/head>
# TMPL: Topic Modeling in PL

## Tracking the flow of ideas through the programming languages literature

Check out our SNAPL paper,
titled Tracking
the flow of ideas through the programming languages
literature. Here we offer larger-scale PDFs of some of the figures
from our paper.

Graphs of LOESS fits of topic proportions for each of the k=20
topics learned from abstracts from ICFP, OOPSLA, PLDI, and POPL.

Graphs of LOESS fits of topic proportions for each of the k=20
topics learned from the full text documents from PLDI and POPL.

Graphs comparing the Euclidean distance between four papers'
citations and the same number of randomly selected papers. CDRS
is Concurrent data
representation synthesis, PCC
is Proof-carrying
code, SEMC
is Space-efficient
manifest contracts, and TAL
is From system F to
typed assembly language.

This graph is for the same data as the previous one, but
using symmetrized
Kullback–Leibler divergence as the distance metric instead
of Euclidean distance. Symmetrized KL divergence even more
dramatically separates related and unrelated work. The related work
search has been updated to use this metric, instead of Euclidean
distance.

Scatterplots indicating the overlap between the top 50 papers in
k=20 topics learned over abstracts and full text for PLDI and
POPL.

Log likelihood of words by topic for k=20 topics learned over
abstracts. The x-axis is the rank of each word, i.e., the left-most
point in a topic's graph is the most likely word for *that
topic*; word #1 is different for each topic. The y-axis is the
log-likelihood of each word.

Topics over time for ICFP, k=20 topics learned over abstracts.

Aggregate weight, by topic. The x-axis are the abstract k=20 topics
for just ICFP; the y-axis is the total weight in that topic over all
years, colored by conference. Note that topic weight is *not*
evenly distributed across all topics.

Aggregate weight, like the above graph, but for fulltext with k=20
on POPL and PLDI.