Data Science

Skills Paper Recommender Subject Identifier ArXiv data blog

Skills

Programming: Python, SQL
Data Science: pandas, scikit-learn
Data Visualization: Matplotlib, seaborn
Web Development: Flask
Version Control: Git, GitHub
Virtual Containers: Docker
Web Services: Google Cloud Run
Mathematical Computing: Mathematica
Mathematical Typesetting: LaTeX

My resume

Paper Recommender

I have created a recommendation engine for finding mathematical papers on arXiv.org.

One of the challenges of mathematics research is finding relevant literature on any topic of interest. While it tends to be easy to find textbooks on the general areas to what you may be studying, it can be much much more difficult to find many of the papers on a given topic. Google searches and word of mouth can get you reasonably far, but this requires knowing the appropriate search terms to game the algorithm or knowing people familiar with the literature. This means you are restricted by how well you can guess search terms or the 'standard papers' known to your specific community, and it is easy to overlook other authors who may be writing on closely related things. Some of this is aided via mathscinet, but this has similar limitations while requiring an active academic institutional affiliation and missing preprints that have yet to be published.

To help with this, I have written a recommendation engine where you can enter the title and abstract of a paper you are interested in (or written yourself), then get recommendations of papers written on similar topics. Alternatively, you can make a guess about what a title and abstract of a paper on a topic you would like to find is, to get a selection of papers on nearby topics. In my experience, this tool recommends a mix of papers I had found via traditional means and interesting papers I otherwise would have missed.

I have compiled the metadata all of the arXiv's math papers into a dataframe and generated two tf-idf vectorizers based on the titles and abstracts of the papers, respectively. Using these, I compare the user-provided sample title and abstract to all of those in the dataframe to generate a list of the ten most similar papers. Thank you to arXiv for use of its open access interoperability.

Paper Recommender

Subject Identifier

I have created a classifier for finding identifying the subject of a mathematical paper.

I have compiled the metadata all of the arXiv's math papers into a dataframe and generated two tf-idf vectorizers based on the titles and abstracts of the papers, respectively. Using these, I compare the user-provided sample title and abstract for a paper to all of those in the dataframe to predict what the appropriate subject classifications for that paper. If no subjects are predicted, a list of the three most likely possibilities are given instead. Thank you to arXiv for use of its open access interoperability.

Subject Identifier

ArXiv data blog

I have started a blog post about my process working through the ArXiv metadata.

Blog post