Data Science

Skills

  • Programming: Python, SQL
  • Data Science: pandas, scikit-learn
  • Data Visualization: Matplotlib, seaborn
  • Web Development: Flask
  • Version Control: Git, GitHub
  • Virtual Containers: Docker
  • Web Services: Google Cloud Run
  • Mathematical Computing: Mathematica
  • Mathematical Typesetting: LaTeX
My resume

Paper Recommender

I have created a recommendation engine for finding mathematical papers on arXiv.org.

One of the challenges of mathematics research is finding relevant literature on any topic of interest. While it tends to be easy to find textbooks on the general areas to what you may be studying, it can be much much more difficult to find many of the papers on a given topic. Google searches and word of mouth can get you reasonably far, but this requires knowing the appropriate search terms to game the algorithm or knowing people familiar with the literature. This means you are restricted by how well you can guess search terms or the 'standard papers' known to your specific community, and it is easy to overlook other authors who may be writing on closely related things. Some of this is aided via mathscinet, but this has similar limitations while requiring an active academic institutional affiliation and missing preprints that have yet to be published.

To help with this, I have written a recommendation engine where you can enter the title and abstract of a paper you are interested in (or written yourself), then get recommendations of papers written on similar topics. Alternatively, you can make a guess about what a title and abstract of a paper on a topic you would like to find is, to get a selection of papers on nearby topics. In my experience, this tool recommends a mix of papers I had found via traditional means and interesting papers I otherwise would have missed.

I have compiled the metadata all of the arXiv's math papers into a dataframe and generated two tf-idf vectorizers based on the titles and abstracts of the papers, respectively. Using these, I compare the user-provided sample title and abstract to all of those in the dataframe to generate a list of the ten most similar papers. Thank you to arXiv for use of its open access interoperability.

Paper Recommender

Subject Identifier

I have created a classifier for finding identifying the subject of a mathematical paper.

I have compiled the metadata all of the arXiv's math papers into a dataframe and generated two tf-idf vectorizers based on the titles and abstracts of the papers, respectively. Using these, I compare the user-provided sample title and abstract for a paper to all of those in the dataframe to predict what the appropriate subject classifications for that paper. If no subjects are predicted, a list of the three most likely possibilities are given instead. Thank you to arXiv for use of its open access interoperability.

Subject Identifier

ArXiv data blog

I have started a blog post about my process working through the ArXiv metadata.

Blog post