new-papers - Sachin Kukreja

The original new-papers used TF-IDF vectors and cosine similarity to find related research papers from an abstract. It worked, but it had a ceiling - TF-IDF is fundamentally a word-frequency technique. It does not understand meaning, only overlap. Two papers on the same topic written with different vocabulary could score poorly despite being closely related.

This branch replaces that approach with allenai/specter - a transformer model from the Allen Institute for AI, pre-trained specifically on scientific papers using citation graphs as the training signal. Papers that cite each other are treated as semantically related, which means the model learns a richer, domain-aware notion of similarity than general-purpose language models.

What changed

v1 · main branch

TF-IDF + cosine similarity

Bag-of-words representation. Fast, lightweight, but blind to semantics and paraphrasing.

v2 · allenai-specter

SPECTER embeddings

Dense transformer embeddings trained on citation graphs. Captures meaning, not just word overlap.

The interface stays the same - paste an abstract, get back a ranked list of related papers and predicted categories. But under the hood, each abstract is now encoded into a 768-dimensional dense vector using SPECTER before similarity search, which means the recommendations are meaningfully better for abstracts that use different terminology to describe overlapping concepts.

Why SPECTER specifically

Most transformer models are trained on general web text. SPECTER is trained on the citation graph of scientific literature - papers that cite each other sit closer in embedding space. That prior makes it significantly more effective for academic paper recommendation than a general-purpose model like BERT.

Limitations and next steps

The tradeoff is compute. SPECTER embeddings need to be precomputed for the entire dataset and stored, whereas TF-IDF matrices are cheap to rebuild. For 100,000 arXiv papers this is manageable offline, but it shifts the architecture towards a precompute-then-serve pattern rather than on-the-fly vectorisation.

Looking ahead, the natural next step is to extend beyond arXiv, incorporate paper titles alongside abstracts for richer embeddings, and add user feedback to refine recommendations over time. SPECTER has since been succeeded by SPECTER2, which adds task-specific adapter modules - that would be a worthwhile upgrade for a production version of this tool.

New Papers

What changed

Limitations and next steps