Jason Portenoy, PhD

Senior Data Engineer, User Engagement and Outreach

OurResearch

Projects

Automated literature review

Autoreview is a framework for building and evaluating systems to automatically select relevant publications for literature reviews, starting from small sets of seed papers. These automated methods have the potential to help researchers save time and effort when keeping up with relevant literature, as well as surfacing papers that more manual methods may miss. I show that this approach can work to recommend relevant literature, and can also be used to systematically compare different features used in the recommendations.

SciSight / Bridger

Scientific silos can hinder innovation. These information "filter bubbles" and the growing challenge of information overload limit awareness across the literature, making it difficult to keep track of even narrow areas of interest, let alone discover new ones. SciSight/Bridger is a project focused on facilitating discovery of scholars and their work, by locating commonalities and contrasts between scientists. This work was published in the CHI 2022 conference. Video of the presentation can be found here. I am continuing to develop these methods to deploy them in a production setting.

GrantExplorer

GrantExplorer is a free, open-source tool for examining the phrases funded by U.S. federal agencies. This includes more than a half-million grants from the National Science Foundation (NSF), National Institutes of Health (NIH), and Department of Defense (DoD). The tool uses React, D3, and FastAPI for interactive visualizations, and Elasticsearch and Gensim language models to intelligently assist with keyword queries. The source code is available on GitHub.

Community detection on large networks

Community detection, or clustering, algorithms can reveal patterns and relationships in complex citation networks. There are many algorithms available that can be used to detect communities in networks, representing several different approaches to the problem. These algorithms are often computationally difficult and with the continually increasing number of publications, the challenge is to adapt these algorithms to very large networks. To address these issues, I have developed new methods to cluster very large citation networks. Using several parallel processing techniques, I am able to perform clustering on networks with hundreds of millions of publications and over 1 billion citation links. The code is available on GitHub.

Visualizing scholarly influence

The scholarly literature is a vast store of formalized human knowledge, interconnected by citations between publications. Looking at these citations is one way to measure the influence of scholarly research. Scholar.eigenfactor.org is a collection of tools I built to use citations to measure and visualize the influence of collections of papers. These collections can represent, for example, individual scholars, or journals, academic departments, or fields of study.

Mapping Misinformation Research

I applied tools for visualizing and analyzing the research publications in the emerging fields of Misinformation and Science Communication. Starting with sets of seed papers representing some key recent research in these fields, I provided tools to examine author relationships, influence from outside fields, and related research identified by machine learning. These tools are collected at misinformationresearch.org, as part of a report for the National Academy of Sciences.

Mathematical Jargon

We analyze the mathematical language used in hundreds of thousands of scientific papers, comparing the use of math across different disciplines. By comparing the distributions of mathematical symbols and terms across fields, we quantify the "jargon barriers" between these fields—the difficulty any two fileds might have communicating based on how different their use of mathematical language is. We find that characterizing fields by their use of mathematical language causes them to group in intuitive ways, and we explore how this approach could be used for recommendations in the scholarly literature, and for helping to bridge knowledge gaps in science.

Predictors of permanent housing for homeless families

As a Data Science for Social Good summer fellow at the University of Washington's eScience Institute, I worked on a team collaborating with the Bill and Melinda Gates Foundation and other organizations to help understand and address the problem of family homelessness in western Washington state. Our contributions included analyzing data, building models, and creating interactive visualizations.

Commitments in written communication

As an intern at Microsoft Research, I worked on a project to help Microsoft's personal digital assistant, Cortana, identify when people make commitments in their outgoing emails, and understand what kinds of commitments they make.