/head>

TMPL: Topic Modeling in PL

Tracking the flow of ideas through the programming languages literature

Michael Greenberg, Kathleen Fisher, and David Walker

Check out our SNAPL paper, titled Tracking the flow of ideas through the programming languages literature. Here we offer larger-scale PDFs of some of the figures from our paper.

Graphs of LOESS fits of topic proportions for each of the k=20 topics learned from abstracts from ICFP, OOPSLA, PLDI, and POPL.

Graphs of LOESS fits of topic proportions for each of the k=20 topics learned from the full text documents from PLDI and POPL.

Graphs comparing the Euclidean distance between four papers' citations and the same number of randomly selected papers. CDRS is Concurrent data representation synthesis, PCC is Proof-carrying code, SEMC is Space-efficient manifest contracts, and TAL is From system F to typed assembly language.

This graph is for the same data as the previous one, but using symmetrized Kullback–Leibler divergence as the distance metric instead of Euclidean distance. Symmetrized KL divergence even more dramatically separates related and unrelated work. The related work search has been updated to use this metric, instead of Euclidean distance.

Scatterplots indicating the overlap between the top 50 papers in k=20 topics learned over abstracts and full text for PLDI and POPL.

Log likelihood of words by topic for k=20 topics learned over abstracts. The x-axis is the rank of each word, i.e., the left-most point in a topic's graph is the most likely word for that topic; word #1 is different for each topic. The y-axis is the log-likelihood of each word.

Topics over time for ICFP, k=20 topics learned over abstracts.

Aggregate weight, by topic. The x-axis are the abstract k=20 topics for just ICFP; the y-axis is the total weight in that topic over all years, colored by conference. Note that topic weight is not evenly distributed across all topics.

Aggregate weight, like the above graph, but for fulltext with k=20 on POPL and PLDI.