Talk:NLP Reading Group
Feel free to post ideas for future papers and topics here!
- 1 Plagiarism detection
- 2 Tree automata and transducers
- 3 Reinforcement learning
- 4 Split-and-merge EM
- 5 Boosting
- 6 Collaborative filtering
- 7 Software for graphical models
- 8 Probabilistic linguistics
- 9 Time series analysis (cointegration, leading indicators)
- 10 Sparse learning
- 11 Deep learning
- 12 Enhanced messaging
- 13 space-efficient search, etc.
- 14 Data structures
The bibliography includes links to some online papers.
I doubt many of these approaches are easy to convert to the multilingual case -- e.g., some use hashing to detect substring exact match, and they don't seem to use probabilities. However, plagiarism detection does have to face some characteristic problems that show up in translation:
- word substitution (via thesaurus)
- movement of blocks of text rather than independent movement of individual words
So it might be interesting to look for a potentially relevant paper.
Tree automata and transducers
Date: Mon, 12 Aug 2002 16:15:10 +0200 From: Juergen Schmidhuber <email@example.com> To: firstname.lastname@example.org Subject: optimal universal reinforcement learner I'd like to draw your attention to the first optimal, universal reinforcement learner. While traditional RL requires unrealistic Markovian assumptions, the recent AIXI model of Marcus Hutter just needs an environment whose reactions to control actions are sampled from an unknown but computable distribution mu. This includes basically every environment we can reasonably talk about: M. Hutter. Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decisions. Proc. ECML-2001, p. 226-238. http://www.idsia.ch/~marcus/ai/paixi.pdf ftp://ftp.idsia.ch/pub/techrep/IDSIA-14-00.ps.gz Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures. Proc. COLT-2002, 364-379. http://www.idsia.ch/~marcus/ai/selfopt.pdf ftp://ftp.idsia.ch/pub/techrep/IDSIA-04-02.ps.gz How does AIXI work? An optimal predictor using a universal Bayesmix XI predicts future events including reward. Here XI is just a weighted sum of all distributions nu in a set M. AIXI now simply selects those action sequences that maximize predicted reward. It turns out that this method really is self-optimizing in the following sense: for all nu in the mix, the average value of actions, given the history, asymptotically converges to the optimal value achieved by the unknown policy which knows the true mu in advance! The necessary condition is that M does admit self-optimizing policies. This is also sufficient! And there is no other policy yielding higher or equal value in all environments nu and a strictly higher value in at least one. Interestingly, the right way of treating the temporal horizon is not to discount it exponentially, as done in most traditional RL work, but to let the future horizon grow in proportion to the learner's lifetime so far. To quote some AIXI referees: "[...] Clearly fundamental and potentially interesting research direction with practical applications. [...] Great theory. Extends a major theoretical direction that led to practical MDL and MML. This approach may do the same thing (similar thing) wrt to decision theory and reinforcement learning, to name a few." "[...] this could be the foundation of a theory which might inspire AI and MC for years (decades?)." Juergen Schmidhuber, IDSIA http://www.idsia.ch/~juergen/unilearn.html
Date: Mon, 4 Nov 2002 12:20:28 -0500 From: "Jason Eisner" <email@example.com> Subject: paper for a future week Hinton described this work to me; the paper might be worth looking at sometime in the reading group. http://www.cs.toronto.edu/~hinton/absps/ueda.html SMEM Algorithm for Mixture Models Ueda, Nakano, Ghahramani & Hinton We Present a split and merge EM (SMEM) algorithm to overcome the local maxima problem in parameter estimation of finite mixture models. In the case of mixture models, local maxima often involve having too many components of a mixture model in one part of the space and too few in another, widely separated part of the space. To escape from such configurations we repeatedly perform simultaneous split and merge operations using a new criterion for efficiently selecting the split and merge candidates. We apply the proposed algorithm to the training of Gaussian mixtures and mixtures of factor analyzers using synthetic and real data and show the effectiveness of using the split and merge operations to improve the likelihood of both the training data and of held-out test data. We also show the practical usefulness of the proposed algorithm by applying it to image compression and pattern recognition problems.
from 12/2002: Review paper on boosting
from 3/2003: Paper associated with the talk below
This is a talk at Cornell next week: >Title: How to be a Bayesian without believing >Speaker: Yoav Freund > >Abstract: Most of the study of Bayesian prediction procedures is premised >on some strong assumptions regarding the prior distribution. In >particular, an assumption that always needs to be made is that the prior >distribution assigns a non-zero probability to the correct model. In >practice, however, we often have to restrict the set of models (in the >support of the prior) in order to make it feasible to compute the >posterior average. As a result, we often can't assume that the correct >model is in our set and the standard Bayesian theory cannot be applied. > >In this work we show a classification procedure which uses model averaging >and can be interpreted as a Bayesian procedure. We show that this >procedure has some desirable properties when it is applied to *any* source >of IID examples, regardless of whether or not the source distribution is >related to the models in the support of the prior. Our main result is >that the predictions made by this procedure are very stable with regard to >the choice of the random training set. > >This stability property has some far-reaching implications on a variety of >issues, including Bias-Variance decomposition of classification error, the >"curse of dimensionality" and Bagging. > >This is joint work with Yishay Mansour and Rob Schapire.
Date: Tue, 25 Feb 2003 12:24:24 -0500 From: "Jason Eisner" <firstname.lastname@example.org> Subject: possible paper - collaborative filtering Another possible paper for reading group. Collaborative filtering is a smoothing problem: guess which objects (e.g., documents, movies, etc.) a given user will like, based on other knowledge about who likes what. Abstractly, this is very similar to predicting which documents are relevant to a given term, or which nouns are likely to be the objects of a given verb. ---------------------------------------------------------------------- http://research.microsoft.com/~carlk/papers/cfw.htm CFW: A Collaborative Filtering System Using Posteriors Over Weights of Evidence Carl M. Kadie, Christopher Meek, & David Heckerman Microsoft Research We describe CFW, a computationally efficient algorithm for collaborative filtering that uses posteriors over weights of evidence. In experiments on real data, we show that this method predicts as well or better than other methods in situations where the size of the user query is small. The new approach works particularly well when the user's query contains low frequency (unpopular) items. The approach complements that of dependency networks which perform well when the size of the query is large. Also in this paper, we argue that the use of posteriors over weights of evidence is a natural way to recommend similar items---a task that is somewhat different from the usual collaborative-filtering task. Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, Edmonton, Alberta, August, 2002.
Software for graphical models
See email below. Others that would make the list now are
- Jeff Bilmes's GMTK
- Hal Daume's Hierarchical Bayes Compiler.
- the Torch 5 machine learning toolbox (front end = Lua scripting language, back end = C++)
- Relevant packages for R?
- Church: A language for generative models - Goodman et al., ICML 2008
Date: 3 Jul 2003 21:38:38 -0400 From: "Jason Eisner" <email@example.com>
I have a suggestion for a line of inquiry when the NLP reading group resumes:
Why don't we look at some of the best, most coherent software for graphical models. We'd start by reading the tutorial documentation together and playing around with the toolkits. Then once we understand how to use them and have an intuition about how they work, we'd delve into the papers that explain the algorithms that they use.
I like the idea of starting with concrete examples and documentation, and then working backward to understand the theory. We could also try this approach for machine learning toolkits such as WEKA. Of course, it only makes sense in areas where the theory and software have both reached a level of elegance.
Here are a couple of packages that have caught my eye recently:
- This is a beautiful system for directed graphical models. It was released only a couple of days ago, and my reaction was that someone had finally done this right. The language looks like ML, but all the variables are random variables, which may be defined in terms of one another via ordinary expressions, and hence are not independent. Evaluation is lazy. You can do inference (ask for the joint or marginal distributions on variables given observations of some other variables), and you can do learning (I think using EM with Dirichlet priors). The language is general enough that you can specify and train PCFGs. (It is possible that this is quite slow -- I don't yet know how things work under the hood.)
- "Index to documentation for software that implements flexible Bayesian models based on neural networks, Gaussian processes, mixtures, and Dirichlet diffusion trees, and that demonstrates various Markov chain Monte Carlo methods."
- Neal is very good, and I suspect this is the best software that uses Markov chain sampling (MCMC) to do inference and learning when it might otherwise be infeasible.
- Daphne Koller's stuff
- ... if it has been released. I think that it uses loopy belief propagation, but I don't know if it's been released in a general form.
- (the text below refers to MALLET)
- Andrew McCallum and his students are apparently working on some general information extraction software based on conditional random fields, presumably modeled after the stuff he worked on at the late lamented Whizbang!. It is already at the point where they are using it locally, so we should keep an eye out for release.
- The Lush language and its libraries for machine learning etc.
- There is also a nice-looking short paper by Yoav Shoham that surveys recent attempts to push AI into programming languages, including IBAL.
From: Philip Resnik <firstname.lastname@example.org> Date: Mon, 8 Sep 2003 10:22:42 -0400 (EDT) Another thought on possible readings: Bod et al. (2003) have a new book called "Probabilistic Linguistics", with some chapters potentially interesting from the point of view of the linguistics/models interface (i.e. with the goal of linguistic explanation as opposed to task-based performance). Chris Manning has a nice chapter on probabilistic syntax. Philip
Time series analysis (cointegration, leading indicators)
From: "Jason Eisner" <email@example.com> Date: Wed, 8 Oct 2003 22:59:40 -0400 An odd but newsworthy source of possible reading: Clive Granger just won (half of) the Nobel Prize in economics, for his work on time series variables.=20=20 His cointegration technique from the 80's tries to determine whether two time-series variables have a statistically significant relationship. (It's not enough for them to follow the same overall trend. To get significance, I suspect, smaller fluctuations also have to be linked more often than expected by chance.)=20=20 Reading about this work suggested several NLP analogies to me. First, I wonder whether this kind of analysis could be used on clustering and k-nearest neighbor methods, especially those based on distributional similarity (e.g., Lillian Lee's work). In such methods, we are assuming that past similarity of p(. | x) and p(. | y) will predict their future similarity -- just as in economic prediction. Such an assumption should be stronger if the hills and valleys of the two distributions are well matched. Granger also looked at the causal relationships among time series, i.e., which variables are leading indicators of the others. This seems analogous to transformation models -- i.e., is NP -> Det Adj N derived from NP -> Det N or vice-versa? Of course, we are less interested in these yes/no questions (are x and y significantly related, does x lead y, etc.) than in the problem of actually predicting y from x. But that's also true of economists, and I'm sure they've addressed those issues by now. There are many textbook treatments of these topics. Cointegration and analysis of leading indicators are widely used now by economic forecasters. jason ---------------------------------------------------------------------- http://news.ft.com/servlet/ContentServer?pagename=3DFT.com/StoryFT/FullStor= y&c=3DStoryFT&cid=3D1059480455082 Designers of forecasting toolkit win Nobel prize By Alan Beattie, International Economy Correspondent Published: October 9 2003 5:00 | Last Updated: October 9 2003 5:00 Two academics who designed essential parts of the modern statistical toolkit for analysing financial markets and macroeconomic forecasting have won this year's Nobel prize for economics. Clive Granger, a British economist now at the University of California in San Diego, and Robert Engle, an American at New York University, made critical breakthroughs in techniques to handle "time series" variables - data recorded at regular intervals. They will share the prize of SKr10m ($1.3m). Prof Granger's major discovery was in the invention of "cointegration", a means of analysing the relationship between two variables, such as wealth and consumption, both of which had a trend over time. Previously, economists had often claimed a correlation between two or more such variables which turned out to be spurious, being driven simply by the fact that both followed a trend - an error pointed out by Prof Granger in the mid-1970s. In subsequent work over a decade later, Prof Granger established that such "non-stationary" variables could be combined in ways which allowed accurate statistical inferences to be drawn. He also developed the "Granger causality test" to estimate which of several variables led movements in the others. Because many macroeconomic variables such as income, consumption, the price level and gross domestic product have persistent trends, the techniques that Prof Granger developed are now routinely used by economic forecasters at central banks, at finance ministries, in academia and in the financial markets. "Suddenly, after Clive's breakthrough, we had apparatus that allowed us to disentangle those things that merely moved together over time from those that were actually related to each other," said Kenneth Wallis, professor of economics at Warwick University in the UK and former editor of the academic journal Econometrica, which published pathbreaking articles by both men. Although Prof Engle did some work with Prof Granger, his award was for separate research on variables whose volatility varies over time. It is commonly used to analyse prices in financial markets. He told reporters yesterday he was delighted with the prize, which reflected how widely the techniques he developed had become used. Traditional tools of analysis assumed that volatility was constant. The techniques Prof Engle developed, particularly autoregressive conditional heteroskedasticity (ARCH) models which predicted volatility would depend on its own behaviour in the past, are now widely employed by investors and financial institutions assessing the risk of portfolios. Prof Wallis said that the two academics had created a highly influential centre of excellence at the University of California at San Diego, though Prof Engle was tempted away to NYU.
ICML/UAI/COLT 2008 is holding a workshop on Sparse Optimization and Variable Selection, whose CFP has many references to techniques, as well as links to past workshops.
Jason 23:18, 3 April 2008 (EDT)
Just ran across an interesting Deep Learning Workshop from NIPS 2007.
By "deep learning," they mean learning neural networks with many layers, or other systems that build up high-level representations one step at a time, e.g., for vision.
The website above has videos of all talks, plus links to quite a number of background papers from 2006 and 2007. This looks like an interesting future topic. We could even watch videos in reading group! Jason 22:34, 3 April 2008 (EDT)
See also the CALO project.
space-efficient search, etc.
We should follow up on some of the references noted at 5/15/08.
Also, Kevin Murphy's thesis has other good stuff in it, all about HMMs and their many generalizations. At some point, we should probably read chapter 5 on stochastic methods for approximate inference in DBNs, notably Rao-Blackwellized particle filtering.
A taxonomy of suffix array construction algorithms. Simon J. Puglisi, W. F. Smyth, Andrew H. Turpin. July 2007. Computing Surveys (CSUR) , Volume 39 Issue 2.