Introduction

At James Madison University’s Office of Research Development, one of my responsibilities is analyzing past awards data for grants that our faculty is interested in. This is important for faculty members who want to apply for grants, as they need to know what has been funded in the past, especially recently, within the specific track or program they are interested in, and how much funding was awarded.

I was asked to put together a 1-hour training for other research development professionals in our office on how to analyze past awards data with little to no knowledge of R.

I prepared this document and supporting files for that purpose. I thought it would be a good idea to share this with the wider community in the spirit of open science. Having participated in NORDP’s annual conference, I wanted to give back to the community by sharing these training materials.

But I don’t know R!

How can we analyze and present past award data with no to minimal knowledge of R?

That can be challenging, especially when you encounter problems. But it is not impossible. I will provide a beginner-friendly introduction to analyzing past awards in R.

You can rely mostly on the provided code, and I will explain what you need to modify. Where stuck, Github Copilot, which is free for those with university affiliations, or other generative AI tools can come to your help. However, you would need to explain your error clearly and step-by-step to get the best results.

Installing R and RStudio

First, install R and RStudio. R is the programming language, while RStudio is an integrated development environment (IDE) for R. You can download R from here, and RStudio from here.

You can refer to the following video for a step-by-step guide on how to install R and RStudio.

Note

This is not a guide to teach you R. If you are interested in learning R, I recommend the following resources:

R for Data Science

Hands-on Data Visualization

Data Visualization - Andrew Heiss

Course: Harvard on Edx - Data Science: R Basics

JMU Course for JMU staff/ students: Introduction to R

Hands-on-practice: Datacamp

For more free R books, see Dr. Mine Dogucu’s website.

Downloading the Repository

Important

If you don’t have Git installed, you can download the repository as a zip file by clicking on the green “Code” button and selecting “Download ZIP.” Then, extract the zip file to a folder on your computer.

Then, open RStudio, and click on the “File” menu, then “New Project,” and select “Existing Directory.” Select the folder where you extracted the zip file.

Alternatively, you can install git by following the instructions here.

After installing R and RStudio, open RStudio, and click on the “File” menu, then “New Project,” “Version Control,” and then “Git.”

Under the “Repository URL” box, paste the URL of the repository: https://github.com/kjayhan/past_grants.git. Name the project directory, select a folder to save it in, and click ‘Create Project’.

This repository contains the Quarto document, the R code, and sample awards data files that you need to analyze past awards. There is also a bonus jmu.scss file that you can use to customize the look of your Quarto document. You can also use this theme I created for JMU branding if you want.

Open the “awards_template.qmd” file in RStudio. This is the template you will use to analyze past awards data. Save it as “NSF X awards.qmd” or whatever you want to name it within the same directory you created by downloading my repository.

Important

When analyzing your past award data, ensure you place it in the data folder as a “.csv” file within the same directory you created by downloading my repository. For NSF past awards, click “Browse Projects Funded by This Program” under any NSF program, then click “CSV” to download the data.

Installing Packages

After downloading the repository, you need to install the packages that we will be using. You only need to do this once, similar to installing an app on your phone. You can do this by running the following code in the R console. If a code in R begins with “#”, it is a comment, and it does not run. If you uncomment it by removing the “#”, it will run. For the following code, you can uncomment the following lines in the code chunk to install the packages.

Show the code

# install.packages("tidyverse")
# install.packages("tidytext")
# install.packages("stopwords")
# install.packages("topicmodels")
# install.packages("tm")
# install.packages("ggraph")
# install.packages("igraph")
# install.packages("janitor")
# install.packages("ggeasy")
# install.packages("reshape2")

Then we load the packages we will be using. This is similar to opening an app on your phone to be able to use it. You can do this by running the following code in the R console.

Show the code

library(tidyverse) # Data manipulation and visualization
library(janitor) # Data cleaning tools
library(tidytext) # Text mining and processing
library(stopwords) # Access to stopword lists
library(topicmodels) # For topic modeling
library(tm) # Text mining utilities
library(ggraph) # For creating graph-based visualizations
library(igraph) # Graph analysis
library(ggeasy) # Easy labeling of ggplot2 graphs

Loading the Data

If different, replace ‘awards.csv’ with the name of your file.

The sample data I use is from NSF Division of Social and Economic Sciences (SBE/SES).

Show the code

awards <- read_csv("data/awards.csv", locale = locale(encoding = "Latin1")) # Read the csv file and assign it to the variable 'awards'

Basic Data Cleaning

Important

If the column names for the amount or date are different, replace all instances of “awarded_amount_to_date” with the correct column name for amount, and the same for “last_amendment_date.” You can do that by using the find and replace feature by clicking CTRL+F or Command+F.

We will begin with cleaning the column names and the data to make them more readable and usable.

Show the code

awards <- janitor::clean_names(awards) # Clean the column names to make them more readable

For now, I will focus on projects with titles starting with ‘Collaborative Research’ to demonstrate data filtering.

If you want to filter based on title, you can change the title to whatever you want. You can also use regex to filter based on a pattern. For example, if you want to filter for titles that contain “Collaborative Research” OR “Collaborative” OR “Research,” you can use str_detect(title, "Collaborative|Research") instead. For more on regex, see this page.

Important

Remove or uncomment the following line if you don’t want to filter anything based on title.

Show the code

awards <- awards |>
  filter(str_detect(title, "Collaborative Research"))

Now, we make sure that the amount is numeric and the date is in the correct format.

Important

If the amount is already numeric, comment the following line by adding “#” at the beginning of the line.

Show the code

awards$awarded_amount_to_date <- parse_number(awards$awarded_amount_to_date)

The date format is set to "%m/%d/%Y" (e.g., 05/15/2024) in the code below. If the date format in your csv file is different, you can change it accordingly. See this page for more information on date formats.

Show the code

awards$last_amendment_date <- parse_date(awards$last_amendment_date, format = "%m/%d/%Y")

You can filter for the years you are interested in. Replace 2016 with the year you are interested in. Greater than 2016 (i.e. > 2016) will give you the years after 2016, not including 2016. For more on data transformation, see this page.

Show the code

awards <- awards |>
  filter(year(last_amendment_date) > 2016)

We will now replace HTML tags from the abstract column to keep only the text. I would also like to remove the following sentence, because it conflates text analysis, as it is in the end of every abstract funded by NSF: “This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.”

Show the code

awards$abstract <- gsub("</>|<br/>|\r\r|\r \r|This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.", " ", awards$abstract) # Remove HTML tags and the sentence

Scripting in Quarto

You can learn more about Quarto here.

You can write in Quarto using RStudio. You can write in plain text, and it can be rendered into HTML or PDF. Headings begin with “#”, and subheadings with “##”, “###” depending on what level of subheading you want.

Introduction

You can write your introduction here. This section is not code. You can write your introduction in plain text.

You can add R code chunks by clicking on the green “+C” button in the top right corner of RStudio, and then selecting “R.” You can also use the keyboard shortcut CTRL+ALT+I or Command+Option+I to insert a code chunk.

Awarded Amounts

In this section, I give an analysis of the awarded amounts.

Some awards are collaborative, meaning the same project (with the same title and abstract) may have multiple entries because the funding is divided across institutions. The following code aggregates the amount for collaborative projects.

Show the code

awards_g <- awards |>
  group_by(title) |>
  summarize(n = n(),
            amount = sum(awarded_amount_to_date, na.rm = T)) |>
  ungroup()

Important

I give you two options for a summary of the awarded amounts in the template. Make sure to remove one of them.

Awarded Amounts Option 1

If the awarded amount range is “normal,” you can use the following code. Otherwise, delete the following code chunk. If you want to make changes to the code (color, font size, axis breaks etc.), your best bet is getting help from the internet or a generative AI tool. If you want to learn more about ggplot2, you can check out this page.

This code creates a density plot of the awarded amounts. You can change the color of the lines by changing the hex color codes. You can also change the x and y axis labels by changing the text in the labs function.

Show the code

density <- awards_g |>
  ggplot(aes(round(amount/1000))) + #dividing by 1000
  geom_density() +
  geom_vline(xintercept = median(round(awards_g$amount/1000)), color = "#450084") +
  geom_vline(xintercept = mean(round(awards_g$amount/1000)), color = "#450084") +
  geom_text(aes(x = median(round(amount/1000)), y = Inf, label = "Median"), color = "#450084", vjust=1,hjust=2.1,size=5,angle=90) +
  geom_text(aes(x = mean(round(amount/1000)), y = Inf, label = "Mean"), color = "#450084", vjust=1,hjust=2.1,size=5,angle=90) +
  scale_x_continuous(breaks = c(seq(0, max(round(awards_g$amount/1000)), by = 1000))) +
  #coord_flip() +
  labs(x = "Awarded Amounts (in $1000)",
       y = "Density") +
  theme_void() +
  theme(legend.position="none",
        element_text(color = '#A4232B'),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title.x = element_text(colour = "#A4232B"),
  panel.background = element_rect(fill = "#F4EFE1"))

density

Show the code

# Save the plot to a file. You can change the file name and the path to the file.

ggsave("figures/density_normal.png", density, dpi = 300)

Awarded Amounts Option 2

If the awarded amounts in your data are skewed, as in this example, you can use the following code which takes the log of the awarded amounts. Make sure to delete Option 1 then.

Show the code

density_log <- awards_g |>
  ggplot(aes(round(amount/1000))) +
  geom_density() +
  scale_x_continuous(trans='log2', 
                     breaks = c(2^seq(0, ceiling(log2(max(round(awards_g$amount/1000))))), round(mean(round(awards_g$amount/1000)),0), round(median(round(awards_g$amount/1000)),0))) +
  geom_vline(xintercept = median(round(awards_g$amount/1000)), color = "#450084") +
  geom_vline(xintercept = mean(round(awards_g$amount/1000)), color = "#450084") +
  geom_text(aes(x = median(round(amount/1000)), y = Inf, label = "Median"), color = "#450084", vjust=1,hjust=2.1,size=5,angle=90) +
  geom_text(aes(x = mean(round(amount/1000)), y = Inf, label = "Mean"), color = "#450084", vjust=1,hjust=2.1,size=5,angle=90) +
  labs(x = "Awarded Amounts in $1000 (log transformed)",
       y = "Density") +
  theme_void() +
  theme(legend.position="none",
        element_text(color = '#A4232B'),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title.x = element_text(colour = "#A4232B"),
  panel.background = element_rect(fill = "#F4EFE1"))


density_log

Show the code

# Save the plot to a file. You can change the file name and the path to the file.

ggsave("figures/density_log.png", density_log, dpi = 300)

This is text and inline code. Feel free to change regular text. The part which begins with “`r”, followed by R code and ends with “`” is called inline code. It runs the R code and replaces it with the result. For example, the following code will replace it with the mean and median of the awarded amounts.

The average of filtered NSF Division of Social and Economic Sciences (SBE/SES) projects is $438,731.6. The median amount is $366,072.

Abstract Analysis

Word Frequency in Abstracts

Here we analyze the abstracts of the awarded projects. We first remove duplicate titles/ abstracts.

Show the code

# Remove duplicate titles
awards_abstracts <- awards |> distinct(title, .keep_all = TRUE)

Important

We will remove common stopwords (e.g., this, that). We will also remove custom stopwords. You can add or remove custom stopwords within quotation marks separated by commas as you see fit.

Show the code

custom_stopwords <- c("will", "can", "br", "s", "g", "e", "(anr).", "project", "research", "study", "results", "across")

We will remove extra whitespace, convert to lowercase, remove punctuation, remove numbers, remove custom stopwords, remove default stopwords, and remove extra whitespace again.

Show the code

awards_abstracts$abstract <- awards_abstracts$abstract |> 
  str_squish() |>  # Remove extra whitespace
  tolower()         # Convert to lowercase
  

text_corpus <- Corpus(VectorSource(awards_abstracts$abstract)) |> 
  # tm_map(content_transformer(function(x) gsub("</>|<br/>", " ", x))) |> # Remove HTML tags
  tm_map(removePunctuation) |> # Remove punctuation
  tm_map(removeNumbers) |> # Remove numbers
    tm_map(removeWords, custom_stopwords) |> # Remove custom stopwords
  tm_map(removeWords, stopwords("english")) |> # Remove default stopwords
  tm_map(stripWhitespace) # Remove extra whitespace

tdm <- TermDocumentMatrix(text_corpus) # Create a term-document matrix
word_freqs <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE) # Sort terms by frequency
tdm_df <- data.frame(word = names(word_freqs), freq = word_freqs) # Create a data frame for the word frequencies


set.seed(1234)

freq <- ggplot(tdm_df[1:20,], aes(x = reorder(word, freq), y = freq)) + # Plot top 20 most frequent words
  geom_bar(stat = "identity", fill = "#450084") +
  coord_flip() + # Flip axes for better readability
  labs(x = "Terms", y = "Count", title = "Most Common Words in Abstracts") +
  ggeasy::easy_center_title() + # Center the title in the plot
  theme_void() + # Remove background elements
  theme(axis.text.y = element_text(size = 12),
        axis.text.x = element_text(size = 12),
        axis.title = element_text(size = 12),
        element_text(color = '#A4232B'),
        axis.title.x = element_text(colour = "#A4232B"),
        axis.title.y = element_text(colour = "#A4232B"),
        panel.background = element_rect(fill = "#F4EFE1")) 

ggsave("figures/freq.png", freq, dpi = 300)

Topic Modeling in Abstracts

We will use Latent Dirichlet Allocation (LDA) for topic modeling. Here, all you need to do is decide how many topics you want to extract. I will use 4 topics in this example. You can change the number of topics by changing the value of “k” in the LDA function. Play around with it to see what works best for your data.

Show the code

dtm <- DocumentTermMatrix(text_corpus)

# Remove empty documents from the dtm
row_totals <- rowSums(as.matrix(dtm))  # Compute row-wise term sums
dtm <- dtm[row_totals > 0, ]  # Keep only non-empty rows


lda <- LDA(dtm, k = 4, control = list(seed = 1234))
 
topics <- tidy(lda, matrix = "beta")

 
top_terms <- topics |>
  group_by(topic) |>
  slice_max(beta, n = 10) |>
  ungroup() |>
  arrange(topic, -beta)


# Define your custom color palette
jmu_palette <- c("#450084", "#CBB677", "#5498B6", "#A4232B", "#5F791C")

# Apply the custom color palette to your plot
tm <- top_terms |>
  mutate(term = reorder_within(term, beta, topic)) |>
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered() +
  scale_x_continuous(labels = scales::percent_format(accuracy = 0.1)) +
  theme_void() +
  scale_fill_manual(values = jmu_palette) +  # Add this line
  theme(axis.text.y = element_text(size = 12),
        axis.text.x = element_text(size = 12),
        axis.title = element_text(size = 12),
        element_text(color = '#A4232B'),
        axis.title.x = element_text(colour = "#A4232B"),
        axis.title.y = element_text(colour = "#A4232B"),
        panel.background = element_rect(fill = "#F4EFE1"))

tm

Show the code

ggsave("figures/tm.png", tm, dpi = 300)

Word Networks (Bigrams) in Abstracts

We will create a word network (bigram: two words) from the abstracts. You can change the bigrams to trigrams (three words) or more by changing the value of “n” in the unnest_tokens function. I will use bigrams in this example. You can change the value of “n” to whatever you want. I will also filter out common stopwords and custom stopwords we previously defined.

Show the code

awards_abstracts <- awards_abstracts |>
  select(title, abstract) # Select only the title and abstract columns
  
bigrams <- awards_abstracts |> 
  unnest_tokens(bigram, abstract, token = "ngrams", n = 2) |> # Tokenize into bigrams
  separate(bigram, c("word1", "word2"), sep = " ") |> # Split bigrams into two words
  filter(!word1 %in% c(stopwords("english"), custom_stopwords),
         !word2 %in% c(stopwords("english"), custom_stopwords)) # Remove stopwords from bigrams

bigram_counts <- bigrams |> count(word1, word2, sort = TRUE) # Count bigram frequencies

We will create a graph from the bigrams. You can change the threshold for filtering the bigrams by changing the value of “n” in the filter function. I will use 20 in this example to include only the bigrams that appear more than 20 or more times. You can change it to whatever you want.

Show the code

bigram_graph <- bigram_counts |> filter(n >= 20) |> # Filter to include only bigrams with frequency > 40
  igraph::graph_from_data_frame() # Create a graph from the bigram data frame


set.seed(123)



bigram <- ggraph(bigram_graph, layout = "fr") + # Create a force-directed graph of bigrams
  geom_edge_link(color = "#450084") +
  geom_node_point(color = "#450084") +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1, repel = TRUE, color = "#450084") + # Add labels to nodes with repelling effect
  theme_void() + # Remove background elements
  theme(panel.background = element_rect(fill = "#F4EFE1"),
        element_text(color = '#A4232B'))


bigram

Show the code

ggsave("figures/bigram.png", bigram, dpi = 300)

Bigram Frequency in Descriptions

We will create a bar plot of the most common bigrams in the abstracts. You can change the number of bigrams to plot by changing the value of “n” in the slice_max function. I will use 20 in this example. You can change it to whatever you want.

Show the code

bigram_freq <- ggplot(bigram_counts[1:20,], aes(x = reorder(paste0(word1, " ", word2), n), y = n)) + # Plot top 20 most frequent words
  geom_bar(stat = "identity", fill = "#450084") +
  coord_flip() + # Flip axes for better readability
  labs(x = "Bigrams", y = "Count", title = "Most Common Bigrams in Abstracts") +
  theme_void() + # Remove background elements
  # make the font size bigger
  theme(axis.text.y = element_text(size = 10),
        axis.text.x = element_text(size = 10),
        axis.title = element_text(size = 10),
        element_text(color = '#A4232B'),
        axis.title.x = element_text(colour = "#A4232B"),
        axis.title.y = element_text(colour = "#A4232B")
        ) +
  #theme_void() + # Remove background elements
  theme(panel.background = element_rect(fill = "#F4EFE1")) +
  ggeasy::easy_center_title() # Center the title in the plot

bigram_freq

Show the code

ggsave("figures/bigram_freq.png", plot = bigram_freq, dpi = 300)

Abstract Takeaways

Write your analysis based on above text analysis here.

AI Summary of Awarded Proposals

You can export the abstracts to a csv file, and then use AI tools to summarize and categorize the abstracts.

Show the code

# exporting the abstract data to a csv file.

write_csv(awards_abstracts, "data/awards_abstracts.csv")

Open the created csv file (awards_abstract file in the data folder) in Excel or another program, and save it as a .pdf file.

Then go to NotebookLM, and upload the pdf file. You can then use the AI to summarize and categorize the abstracts.

You can also use other AI tools for this task.

Rendering

After you finish editing the Quarto document, you can render it to an .html file (or .pdf, .docx, .pptx – but you need much more customization for that).

You can do this by clicking on the “Render” button in the top center of the RStudio window (blue right-facing arrow icon). You can also use the keyboard shortcut CTRL+Shift+K (or Command+Shift+K on Mac) to render the document.

Congratulations

Congratulations! You have successfully analyzed past award data using R and RStudio. You can now use this knowledge to analyze your own past award data. If you have any questions or need help, feel free to reach out to me (after you consult with your AI assistant(s) first!).

--- title: "[Research Development] Analyzing Past Awards using R" subtitle: "A Beginner's Guide with Minimal Code-Tweaking" date: today author: - name: "[Kadir Jun Ayhan](https://ayhan.phd)" degrees: "Ph.D." email: ayhankx@jmu.edu affiliations: - name: "JMU REDI Office of Research Development" format: html: code-fold: true code-summary: "Show the code" code-tools: true title-block-banner: "#450084" title-block-banner-color: "#CBB677" toc: true editor: source image: "figures/bigram.png" warning: false echo: true embed-resources: true comments: hypothesis: true categories: [R, dataviz, text analysis, NSF, grants, research development] --- # Introduction At [James Madison University's Office of Research Development](https://www.jmu.edu/research/development/index.shtml), one of my responsibilities is analyzing past awards data for grants that our faculty is interested in. This is important for faculty members who want to apply for grants, as they need to know what has been funded in the past, especially recently, within the specific track or program they are interested in, and how much funding was awarded. I was asked to put together a 1-hour training for other research development professionals in our office on how to analyze past awards data with little to no knowledge of R. I prepared this document and supporting files for that purpose. I thought it would be a good idea to share this with the wider community in the spirit of open science. Having participated in [NORDP](https://nordp.org/)’s annual conference, I wanted to give back to the community by sharing these training materials. # But I don't know R! How can we analyze and present past award data with no to minimal knowledge of R? That can be challenging, especially when you encounter problems. But it is not impossible. I will provide a beginner-friendly introduction to analyzing past awards in R. You can rely mostly on the provided code, and I will explain what you need to modify. Where stuck, [Github Copilot](https://github.com/features/copilot), which is free for those with university affiliations, or other generative AI tools can come to your help. However, you would need to explain your error clearly and step-by-step to get the best results. # Installing R and RStudio First, install R and RStudio. R is the programming language, while RStudio is an integrated development environment (IDE) for R. You can download R from [here](https://cran.r-project.org/), and RStudio from [here](https://www.rstudio.com/products/rstudio/download/). You can refer to the following [video](https://www.youtube.com/watch?v=ulIv0NiVTs4) for a step-by-step guide on how to install R and RStudio. {{< video https://www.youtube.com/watch?v=ulIv0NiVTs4 >}} :::{.callout-note} This is not a guide to teach you R. If you are interested in learning R, I recommend the following resources: [R for Data Science](https://r4ds.had.co.nz/){target="_blank"} [Hands-on Data Visualization](https://handsondataviz.org/){target="_blank"} [Data Visualization - Andrew Heiss](https://datavizs23.classes.andrewheiss.com/){target="_blank"} Course: [Harvard on Edx - Data Science: R Basics](https://www.edx.org/course/data-science-r-basics){target="_blank"} JMU Course for JMU staff/ students: [Introduction to R](https://www.jmu.edu/cfi/scholarship/resources/developing-your-skills.shtml#qskills1) Hands-on-practice: [Datacamp](https://www.datacamp.com){target="_blank"} For more free R books, see [Dr. Mine Dogucu's website](https://www.learnr4free.com/en/index.html){target="_blank"}. ::: # Downloading the Repository :::{.callout-important} If you don't have Git installed, you can download the repository as a zip file by clicking on the green "Code" button and selecting "Download ZIP." Then, extract the zip file to a folder on your computer. Then, open RStudio, and click on the "File" menu, then "New Project," and select "Existing Directory." Select the folder where you extracted the zip file. Alternatively, you can install git by following the instructions [here](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git). ::: After installing R and RStudio, open RStudio, and click on the “File” menu, then “New Project,” "Version Control," and then “Git.” Under the "Repository URL" box, paste the URL of the repository: <https://github.com/kjayhan/past_grants.git>. Name the project directory, select a folder to save it in, and click 'Create Project'. This repository contains the Quarto document, the R code, and sample awards data files that you need to analyze past awards. There is also a bonus jmu.scss file that you can use to customize the look of your Quarto document. You can also use this theme I created for JMU branding if you want. Open the "awards_template.qmd" file in RStudio. This is the template you will use to analyze past awards data. Save it as "NSF X awards.qmd" or whatever you want to name it within the same directory you created by downloading my repository. :::{.callout-important} When analyzing your past award data, ensure you place it in the data folder as a ".csv" file within the same directory you created by downloading my repository. For NSF past awards, click "Browse Projects Funded by This Program" under any NSF program, then click "CSV" to download the data. ::: # Installing Packages After downloading the repository, you need to install the packages that we will be using. You only need to do this once, similar to installing an app on your phone. You can do this by running the following code in the R console. If a code in R begins with "\#", it is a comment, and it does not run. If you uncomment it by removing the "\#", it will run. For the following code, you can uncomment the following lines in the code chunk to install the packages. ```{r} # install.packages("tidyverse") # install.packages("tidytext") # install.packages("stopwords") # install.packages("topicmodels") # install.packages("tm") # install.packages("ggraph") # install.packages("igraph") # install.packages("janitor") # install.packages("ggeasy") # install.packages("reshape2") ``` Then we load the packages we will be using. This is similar to opening an app on your phone to be able to use it. You can do this by running the following code in the R console. ```{r} library(tidyverse) # Data manipulation and visualization library(janitor) # Data cleaning tools library(tidytext) # Text mining and processing library(stopwords) # Access to stopword lists library(topicmodels) # For topic modeling library(tm) # Text mining utilities library(ggraph) # For creating graph-based visualizations library(igraph) # Graph analysis library(ggeasy) # Easy labeling of ggplot2 graphs ``` # Loading the Data If different, replace 'awards.csv' with the name of your file. The sample data I use is from [NSF Division of Social and Economic Sciences (SBE/SES)](https://www.nsf.gov/awardsearch/advancedSearchResult?ProgOrganization=04050000). ```{r} awards <- read_csv("data/awards.csv", locale = locale(encoding = "Latin1")) # Read the csv file and assign it to the variable 'awards' ``` # Basic Data Cleaning :::{.callout-important} If the column names for the amount or date are different, replace all instances of "awarded_amount_to_date" with the correct column name for amount, and the same for "last_amendment_date." You can do that by using the find and replace feature by clicking CTRL+F or Command+F. ::: We will begin with cleaning the column names and the data to make them more readable and usable. ```{r} awards <- janitor::clean_names(awards) # Clean the column names to make them more readable ``` For now, I will focus on projects with titles starting with 'Collaborative Research' to demonstrate data filtering. If you want to filter based on title, you can change the title to whatever you want. You can also use regex to filter based on a pattern. For example, if you want to filter for titles that contain "Collaborative Research" OR "Collaborative" OR "Research," you can use `str_detect(title, "Collaborative|Research")` instead. For more on regex, see this [page](https://r4ds.hadley.nz/regexps.html). :::{.callout-important} Remove or uncomment the following line if you don't want to filter anything based on title. ::: ```{r} awards <- awards |> filter(str_detect(title, "Collaborative Research")) ``` Now, we make sure that the amount is numeric and the date is in the correct format. :::{.callout-important} If the amount is already numeric, comment the following line by adding "\#" at the beginning of the line. ::: ```{r} awards$awarded_amount_to_date <- parse_number(awards$awarded_amount_to_date) ``` The date format is set to `"%m/%d/%Y"` (e.g., 05/15/2024) in the code below. If the date format in your csv file is different, you can change it accordingly. See this [page](https://www.geeksforgeeks.org/how-to-use-date-formats-in-r/) for more information on date formats. ```{r} awards$last_amendment_date <- parse_date(awards$last_amendment_date, format = "%m/%d/%Y") ``` You can filter for the years you are interested in. Replace 2016 with the year you are interested in. Greater than 2016 (i.e. > 2016) will give you the years after 2016, not including 2016. For more on data transformation, see this [page](https://r4ds.hadley.nz/data-transform.html). ```{r} awards <- awards |> filter(year(last_amendment_date) > 2016) ``` We will now replace HTML tags from the abstract column to keep only the text. I would also like to remove the following sentence, because it conflates text analysis, as it is in the end of every abstract funded by NSF: "This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria." ```{r} awards$abstract <- gsub("</>|<br/>|\r\r|\r \r|This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.", " ", awards$abstract) # Remove HTML tags and the sentence ``` # Scripting in Quarto You can learn more about Quarto [here](https://quarto.org/docs/get-started/). You can write in Quarto using RStudio. You can write in plain text, and it can be rendered into HTML or PDF. Headings begin with "#", and subheadings with "##", "###" depending on what level of subheading you want. # Introduction You can write your introduction here. This section is not code. You can write your introduction in plain text. You can add R code chunks by clicking on the green "+C" button in the top right corner of RStudio, and then selecting "R." You can also use the keyboard shortcut CTRL+ALT+I or Command+Option+I to insert a code chunk. # Awarded Amounts In this section, I give an analysis of the awarded amounts. Some awards are collaborative, meaning the same project (with the same title and abstract) may have multiple entries because the funding is divided across institutions. The following code aggregates the amount for collaborative projects. ```{r} awards_g <- awards |> group_by(title) |> summarize(n = n(), amount = sum(awarded_amount_to_date, na.rm = T)) |> ungroup() ``` :::{.callout-important} I give you two options for a summary of the awarded amounts in the template. Make sure to remove one of them. ::: ## Awarded Amounts Option 1 If the awarded amount range is "normal," you can use the following code. Otherwise, delete the following code chunk. If you want to make changes to the code (color, font size, axis breaks etc.), your best bet is getting help from the internet or a generative AI tool. If you want to learn more about ggplot2, you can check out this [page](https://ggplot2.tidyverse.org/articles/ggplot2-specs.html). This code creates a density plot of the awarded amounts. You can change the color of the lines by changing the hex color codes. You can also change the x and y axis labels by changing the text in the labs function. ```{r} density <- awards_g |> ggplot(aes(round(amount/1000))) + #dividing by 1000 geom_density() + geom_vline(xintercept = median(round(awards_g$amount/1000)), color = "#450084") + geom_vline(xintercept = mean(round(awards_g$amount/1000)), color = "#450084") + geom_text(aes(x = median(round(amount/1000)), y = Inf, label = "Median"), color = "#450084", vjust=1,hjust=2.1,size=5,angle=90) + geom_text(aes(x = mean(round(amount/1000)), y = Inf, label = "Mean"), color = "#450084", vjust=1,hjust=2.1,size=5,angle=90) + scale_x_continuous(breaks = c(seq(0, max(round(awards_g$amount/1000)), by = 1000))) + #coord_flip() + labs(x = "Awarded Amounts (in $1000)", y = "Density") + theme_void() + theme(legend.position="none", element_text(color = '#A4232B'), axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.text.x = element_text(angle = 90, hjust = 1), axis.title.x = element_text(colour = "#A4232B"), panel.background = element_rect(fill = "#F4EFE1")) density # Save the plot to a file. You can change the file name and the path to the file. ggsave("figures/density_normal.png", density, dpi = 300) ``` ## Awarded Amounts Option 2 If the awarded amounts in your data are skewed, as in this example, you can use the following code which takes the log of the awarded amounts. Make sure to delete Option 1 then. ```{r} density_log <- awards_g |> ggplot(aes(round(amount/1000))) + geom_density() + scale_x_continuous(trans='log2', breaks = c(2^seq(0, ceiling(log2(max(round(awards_g$amount/1000))))), round(mean(round(awards_g$amount/1000)),0), round(median(round(awards_g$amount/1000)),0))) + geom_vline(xintercept = median(round(awards_g$amount/1000)), color = "#450084") + geom_vline(xintercept = mean(round(awards_g$amount/1000)), color = "#450084") + geom_text(aes(x = median(round(amount/1000)), y = Inf, label = "Median"), color = "#450084", vjust=1,hjust=2.1,size=5,angle=90) + geom_text(aes(x = mean(round(amount/1000)), y = Inf, label = "Mean"), color = "#450084", vjust=1,hjust=2.1,size=5,angle=90) + labs(x = "Awarded Amounts in $1000 (log transformed)", y = "Density") + theme_void() + theme(legend.position="none", element_text(color = '#A4232B'), axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.text.x = element_text(angle = 90, hjust = 1), axis.title.x = element_text(colour = "#A4232B"), panel.background = element_rect(fill = "#F4EFE1")) density_log # Save the plot to a file. You can change the file name and the path to the file. ggsave("figures/density_log.png", density_log, dpi = 300) ``` This is text and inline code. Feel free to change regular text. The part which begins with "\`r", followed by R code and ends with "\`" is called inline code. It runs the R code and replaces it with the result. For example, the following code will replace it with the mean and median of the awarded amounts. The average of filtered [NSF Division of Social and Economic Sciences (SBE/SES)](https://www.nsf.gov/awardsearch/simpleSearchResult?queryText=diplomacy&ActiveAwards=true) projects is `r paste0("$", format(mean(awards_g$amount), big.mark = ",", scientific = FALSE))`. The median amount is `r paste0("$", format(median(awards_g$amount), big.mark = ",", scientific = FALSE))`. # Abstract Analysis ## Word Frequency in Abstracts Here we analyze the abstracts of the awarded projects. We first remove duplicate titles/ abstracts. ```{r} # Remove duplicate titles awards_abstracts <- awards |> distinct(title, .keep_all = TRUE) ``` :::{.callout-important} We will remove common stopwords (e.g., this, that). We will also remove custom stopwords. You can add or remove custom stopwords within quotation marks separated by commas as you see fit. ::: ```{r} custom_stopwords <- c("will", "can", "br", "s", "g", "e", "(anr).", "project", "research", "study", "results", "across") ``` We will remove extra whitespace, convert to lowercase, remove punctuation, remove numbers, remove custom stopwords, remove default stopwords, and remove extra whitespace again. ```{r} awards_abstracts$abstract <- awards_abstracts$abstract |> str_squish() |> # Remove extra whitespace tolower() # Convert to lowercase text_corpus <- Corpus(VectorSource(awards_abstracts$abstract)) |> # tm_map(content_transformer(function(x) gsub("</>|<br/>", " ", x))) |> # Remove HTML tags tm_map(removePunctuation) |> # Remove punctuation tm_map(removeNumbers) |> # Remove numbers tm_map(removeWords, custom_stopwords) |> # Remove custom stopwords tm_map(removeWords, stopwords("english")) |> # Remove default stopwords tm_map(stripWhitespace) # Remove extra whitespace tdm <- TermDocumentMatrix(text_corpus) # Create a term-document matrix word_freqs <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE) # Sort terms by frequency tdm_df <- data.frame(word = names(word_freqs), freq = word_freqs) # Create a data frame for the word frequencies set.seed(1234) freq <- ggplot(tdm_df[1:20,], aes(x = reorder(word, freq), y = freq)) + # Plot top 20 most frequent words geom_bar(stat = "identity", fill = "#450084") + coord_flip() + # Flip axes for better readability labs(x = "Terms", y = "Count", title = "Most Common Words in Abstracts") + ggeasy::easy_center_title() + # Center the title in the plot theme_void() + # Remove background elements theme(axis.text.y = element_text(size = 12), axis.text.x = element_text(size = 12), axis.title = element_text(size = 12), element_text(color = '#A4232B'), axis.title.x = element_text(colour = "#A4232B"), axis.title.y = element_text(colour = "#A4232B"), panel.background = element_rect(fill = "#F4EFE1")) ggsave("figures/freq.png", freq, dpi = 300) ``` ## Topic Modeling in Abstracts We will use Latent Dirichlet Allocation (LDA) for topic modeling. Here, all you need to do is decide how many topics you want to extract. I will use 4 topics in this example. You can change the number of topics by changing the value of "k" in the LDA function. Play around with it to see what works best for your data. ```{r} dtm <- DocumentTermMatrix(text_corpus) # Remove empty documents from the dtm row_totals <- rowSums(as.matrix(dtm)) # Compute row-wise term sums dtm <- dtm[row_totals > 0, ] # Keep only non-empty rows lda <- LDA(dtm, k = 4, control = list(seed = 1234)) topics <- tidy(lda, matrix = "beta") top_terms <- topics |> group_by(topic) |> slice_max(beta, n = 10) |> ungroup() |> arrange(topic, -beta) # Define your custom color palette jmu_palette <- c("#450084", "#CBB677", "#5498B6", "#A4232B", "#5F791C") # Apply the custom color palette to your plot tm <- top_terms |> mutate(term = reorder_within(term, beta, topic)) |> ggplot(aes(beta, term, fill = factor(topic))) + geom_col(show.legend = FALSE) + facet_wrap(~ topic, scales = "free") + scale_y_reordered() + scale_x_continuous(labels = scales::percent_format(accuracy = 0.1)) + theme_void() + scale_fill_manual(values = jmu_palette) + # Add this line theme(axis.text.y = element_text(size = 12), axis.text.x = element_text(size = 12), axis.title = element_text(size = 12), element_text(color = '#A4232B'), axis.title.x = element_text(colour = "#A4232B"), axis.title.y = element_text(colour = "#A4232B"), panel.background = element_rect(fill = "#F4EFE1")) tm ggsave("figures/tm.png", tm, dpi = 300) ``` ## Word Networks (Bigrams) in Abstracts We will create a word network (bigram: two words) from the abstracts. You can change the bigrams to trigrams (three words) or more by changing the value of "n" in the `unnest_tokens` function. I will use bigrams in this example. You can change the value of "n" to whatever you want. I will also filter out common stopwords and custom stopwords we previously defined. ```{r} awards_abstracts <- awards_abstracts |> select(title, abstract) # Select only the title and abstract columns bigrams <- awards_abstracts |> unnest_tokens(bigram, abstract, token = "ngrams", n = 2) |> # Tokenize into bigrams separate(bigram, c("word1", "word2"), sep = " ") |> # Split bigrams into two words filter(!word1 %in% c(stopwords("english"), custom_stopwords), !word2 %in% c(stopwords("english"), custom_stopwords)) # Remove stopwords from bigrams bigram_counts <- bigrams |> count(word1, word2, sort = TRUE) # Count bigram frequencies ``` We will create a graph from the bigrams. You can change the threshold for filtering the bigrams by changing the value of "n" in the filter function. I will use 20 in this example to include only the bigrams that appear more than 20 or more times. You can change it to whatever you want. ```{r} bigram_graph <- bigram_counts |> filter(n >= 20) |> # Filter to include only bigrams with frequency > 40 igraph::graph_from_data_frame() # Create a graph from the bigram data frame set.seed(123) bigram <- ggraph(bigram_graph, layout = "fr") + # Create a force-directed graph of bigrams geom_edge_link(color = "#450084") + geom_node_point(color = "#450084") + geom_node_text(aes(label = name), vjust = 1, hjust = 1, repel = TRUE, color = "#450084") + # Add labels to nodes with repelling effect theme_void() + # Remove background elements theme(panel.background = element_rect(fill = "#F4EFE1"), element_text(color = '#A4232B')) bigram ggsave("figures/bigram.png", bigram, dpi = 300) ``` ## Bigram Frequency in Descriptions We will create a bar plot of the most common bigrams in the abstracts. You can change the number of bigrams to plot by changing the value of "n" in the `slice_max` function. I will use 20 in this example. You can change it to whatever you want. ```{r} bigram_freq <- ggplot(bigram_counts[1:20,], aes(x = reorder(paste0(word1, " ", word2), n), y = n)) + # Plot top 20 most frequent words geom_bar(stat = "identity", fill = "#450084") + coord_flip() + # Flip axes for better readability labs(x = "Bigrams", y = "Count", title = "Most Common Bigrams in Abstracts") + theme_void() + # Remove background elements # make the font size bigger theme(axis.text.y = element_text(size = 10), axis.text.x = element_text(size = 10), axis.title = element_text(size = 10), element_text(color = '#A4232B'), axis.title.x = element_text(colour = "#A4232B"), axis.title.y = element_text(colour = "#A4232B") ) + #theme_void() + # Remove background elements theme(panel.background = element_rect(fill = "#F4EFE1")) + ggeasy::easy_center_title() # Center the title in the plot bigram_freq ggsave("figures/bigram_freq.png", plot = bigram_freq, dpi = 300) ``` # Abstract Takeaways Write your analysis based on above text analysis here. # AI Summary of Awarded Proposals You can export the abstracts to a csv file, and then use AI tools to summarize and categorize the abstracts. ```{r} # exporting the abstract data to a csv file. write_csv(awards_abstracts, "data/awards_abstracts.csv") ``` Open the created csv file (awards_abstract file in the data folder) in Excel or another program, and save it as a .pdf file. Then go to NotebookLM, and upload the pdf file. You can then use the AI to summarize and categorize the abstracts. You can also use other AI tools for this task. # Rendering After you finish editing the Quarto document, you can render it to an .html file (or .pdf, .docx, .pptx -- but you need much more customization for that). You can do this by clicking on the "Render" button in the top center of the RStudio window (blue right-facing arrow icon). You can also use the keyboard shortcut CTRL+Shift+K (or Command+Shift+K on Mac) to render the document. # Congratulations Congratulations! You have successfully analyzed past award data using R and RStudio. You can now use this knowledge to analyze your own past award data. If you have any questions or need help, feel free to reach out to [me](mailto:ayhankx@jmu.edu) (after you consult with your AI assistant(s) first!).