Jason Portenoy, PhD

NYC-based data engineer/data scientist

Projects

OpenAlex
OpenAlex is the world’s largest free and open Scientific Knowledge Graph (SKG), created by the nonprofit startup OurResearch. Representing the entire global research ecosystem, it comprises more than 250 million scholarly works, billions of links between them, and terabytes of linked metadata. As Senior Data Engineer at OurResearch, my primary role was to understand the strengths and limitations of the OpenAlex data, and to help our users do the same.
Automated literature review
Autoreview is a framework for building and evaluating systems to automatically select relevant publications for literature reviews, starting from small sets of seed papers. These automated methods have the potential to help researchers save time and effort when keeping up with relevant literature, as well as surfacing papers that more manual methods may miss. I show that this approach can work to recommend relevant literature, and can also be used to systematically compare different features used in the recommendations.

Portenoy, J., & West, J. D. (2020). Constructing and evaluating automated literature review systems. Scientometrics. https://doi.org/10.1007/s11192-020-03490-w
SciSight / Bridger
Scientific silos can hinder innovation. These information "filter bubbles" and the growing challenge of information overload limit awareness across the literature, making it difficult to keep track of even narrow areas of interest, let alone discover new ones. SciSight/Bridger is a project focused on facilitating discovery of scholars and their work, by locating commonalities and contrasts between scientists. This work was published in the CHI 2022 conference. Video of the presentation can be found here.

Portenoy, J., Radensky, M., West, J. D., Horvitz, E., Weld, D. S., & Hope, T. (2022). Bursting Scientific Filter Bubbles: Boosting Innovation via Novel Author Discovery. CHI 2022. https://doi.org/10.1145/3491102.3501905
GrantExplorer
GrantExplorer is a free, open-source tool for examining the phrases funded by U.S. federal agencies. This includes more than a half-million grants from the National Science Foundation (NSF), National Institutes of Health (NIH), and Department of Defense (DoD). The tool uses React, D3, and FastAPI for interactive visualizations, and Elasticsearch and Gensim language models to intelligently assist with keyword queries. The source code is available on GitHub.
Community detection on large networks
Community detection, or clustering, algorithms can reveal patterns and relationships in complex citation networks. There are many algorithms available that can be used to detect communities in networks, representing several different approaches to the problem. These algorithms are often computationally difficult and with the continually increasing number of publications, the challenge is to adapt these algorithms to very large networks. To address these issues, I have developed new methods to cluster very large citation networks. Using several parallel processing techniques, I am able to perform clustering on networks with hundreds of millions of publications and over 1 billion citation links. The code is available on GitHub.
Visualizing scholarly influence
The scholarly literature is a vast store of formalized human knowledge, interconnected by citations between publications. Looking at these citations is one way to measure the influence of scholarly research. Scholar.eigenfactor.org is a collection of tools I built to use citations to measure and visualize the influence of collections of papers. These collections can represent, for example, individual scholars, or journals, academic departments, or fields of study.
Mapping Misinformation Research
I applied tools for visualizing and analyzing the research publications in the emerging fields of Misinformation and Science Communication. Starting with sets of seed papers representing some key recent research in these fields, I provided tools to examine author relationships, influence from outside fields, and related research identified by machine learning. These tools are collected at misinformationresearch.org, as part of a report for the National Academy of Sciences.
Mathematical Jargon
We analyze the mathematical language used in hundreds of thousands of scientific papers, comparing the use of math across different disciplines. By comparing the distributions of mathematical symbols and terms across fields, we quantify the "jargon barriers" between these fields—the difficulty any two fileds might have communicating based on how different their use of mathematical language is. We find that characterizing fields by their use of mathematical language causes them to group in intuitive ways, and we explore how this approach could be used for recommendations in the scholarly literature, and for helping to bridge knowledge gaps in science.
Predictors of permanent housing for homeless families
As a Data Science for Social Good summer fellow at the University of Washington's eScience Institute, I worked on a team collaborating with the Bill and Melinda Gates Foundation and other organizations to help understand and address the problem of family homelessness in western Washington state. Our contributions included analyzing data, building models, and creating interactive visualizations.
Commitments in written communication
As an intern at Microsoft Research, I worked on a project to help Microsoft's personal digital assistant, Cortana, identify when people make commitments in their outgoing emails, and understand what kinds of commitments they make.