Comparing the Class of 2020 and 2021 with Machine Learning and Reddit

Introduction

The college application process is stressful. Any student knows that it will be overwhelming even in the best of cases. On top of that, the many students that graduated high school in the past two years had another stressor that no one could have predicted: COVID-19. From missing out on their last year of high school to having trouble (both financial and academic) applying to their top-choice college, these students have had it bad. I, as part of the class of 2020, can attest to this. The coronavirus pandemic has caused much uncertainty and unexpected roadblocks in my college application process.

Even just between the class of 2020 and the class of 2021, the times have changed dramatically. Most schools have been able to partially reopen and we are seeing a return to ‘normalcy’ already (Operational). I first noticed this difference when talking to students in the class of 2021 who seemed to be both have a more typical senior year and also a more stressful one. These are seemingly juxtaposing, however. I am not the only one to notice this, either. The Tampa Bay Times interviewed 10 high school seniors on their experiences. Some said positive things about the year:

“‘I go to school every day, and I try to appreciate it, because last year when I didn’t have it, I was bored out of my mind,’ said Gavyn Dorsey, a senior at Zephyrhills High. ‘I treat every day like it’s the best day.’” (Tampa)

Others stated the negatives:

“‘I’m not going to lie. It hurts,’ said Jenesis Montero, a senior in Blake High School’s fine arts magnet program, who saw so many of her performances and competitions canceled. ‘I didn’t think they would take everything away. We have nothing. We have literally nothing.’” (Tampa)

It should be noted, however, that these ideas aren’t mutually exclusive. Any given student could both be upset because of the state of their senior year of high school while still being optimistic and appreciate what they still have. What interests me, however, is the difference between the class of 2020 and the class of 2021.

The rest of this article aims to figure this out as well as answering the question “what are the differences in attitudes in the class of 2021 compared to the class of 2020?”. To answer this, I took data from Reddit from the popular subreddits r/ACT, r/SAT, and r/ApplyingToCollege and analyzed it with the machine learning technique word2vec.

Why Reddit?

I chose Reddit for my corpus for three reasons: it is organized into subreddits, there are dates attached to posts, and posts are long (so the students can elaborate on their experiences more). Reddit is also a popular social media for posting about college applications. For example, r/ApplyingToCollege self-proclaims themselves as “the premier forum for college admissions” with 341,000 active members. Therefore, it acts as a great corpus for seeing the thoughts of students applying to colleges. I also gathered data from r/ACT and r/SAT as taking these exams is vital for the college application process. Although several schools got rid of their standardized testing requirements in light of the pandemic, these communities have still been very active.

For scrapping Reddit, I used which is an API in Python for getting information from Reddit. For the class of 2020, I considered all posts from the respective subreddits that fall between the dates of August 1st, 2019 to June 1st, 2020. For the class of 2021, the date range was just one year later: August 1st, 2020 to June 1st, 2021. This method does result in some spillover, such as ambitious freshmen and sophomores posting as well as those who like the linger on the subreddit after they are done with the application process. However, because these spillovers are uncommon, the data is still representative of the respective classes.

Python code used

What is Word2Vec?

Word2Vec is a technique commonly used in natural language processing that assigns given words to a vector (which can have hundreds of dimensions). Essentially, it takes the meaning of a given word and maps it to a vector. Why is this helpful? Well, because vectors are just linear algebra, we can do math with them. We can see relationships between two word vectors to understand their semantic relationship. That is, we can see how similar words are by seeing how close they are in a vector space. For simplicity, you can think of it as a graph with points. The closer words are to each other on that graph, the closer their meanings are.

If we import the Reddit corpora into the word2vec package in the R programming language, we can visualize some of the data.

Here is the class of 2020:

Here is the class of 2021:

Keep in mind that the actual location of each word is arbitrary (which is why the numbers in the class of 2020 dataset are in a different location than the numbers of the class of 2021). It’s the relationships that are important.

The Results

I tried three different queries in both of the corpora to compare: uncertainty, covid, and stress. These results were then used to make conclusions about the different attitudes of the class of 2020 and the class of 2021.

Uncertainty

First, I wanted to see students’ thoughts regarding uncertainty. To do this, I found what words are most similar and related to ‘uncertain’ and ‘uncertainty’ by running the following R code:

This resulted in the following table:

(Tables shown here)

Already, you can notice how COVID-19 is present in 6 (bolded) of the top 20 similar words. For comparison, I also did this for the class of 2021 corpus. This yielded the following table:

(Tables shown here)

Surprisingly, no words relating to the coronavirus made their way to the top 20. From this, we can assume that the uncertainty of COVID-19 affected the class of 2020 significantly more than the class of 2021. Although this was expected, I was surprised how not a single COVID-19 related word appeared.

COVID-19

The second query that I tried was searching for words relating to COVID-19. For this, I used the R code:

That code resulted in the following table:

(Tables shown here)

The first two things that stick out in this table are the numbers and the URLs. The numbers are (mostly) specific dates of canceled events such as SAT or ACT test dates, and the URLs, www.forbes.com and www.nytimes.com, are popular sites that track the coronavirus. The exception is ‘19’ which is from the phrase “covid 19”. However, these numbers and URLs aren’t apparent in the class of 2021 version of the table as shown below:

(Tables shown here)

The closest words to “covid” in the class of 2021’s corpora are less relating to the virus itself and more on the effects of it — mainly the cancelations. By this, I can assume that the class of 2021 views COVID-19 as a fact of nature instead of a new phenomenon that is a cause of uncertainty.

Stress

The last query that I did was looking at the word ‘stress’. For this, the R command is:

The table for the class of 2020 is below.

(Tables shown here)

The class of 2021’s corpus yielded the following table:

(Tables shown here)

These results surprised me. I was originally thinking that both the graduating classes would have different similar words relating to “stress” because I assumed that the stress relating to new, unknown coronavirus would look different from the more predictable stressors of the college application process. These tables, however, are strikingly similar. The only notable difference is that the class of 2020’s table had the word “uncertainty”, which is expected in the time that COVID-19 was new. This word, however, is only the 17th closest word so it’s not that drastic of a difference.

Conclusion

The differences between graduating classes weren’t as stark as I initially thought. Although there were some clear differences relating to COVID-19 and uncertainty, the overall stress of the students were described similarly. This finding just goes to show how the college application process is relatively stable. So although both the class of 2020 and class of 2021 had vastly different college application processes, the attitudes were relatively similar.

If I were to do further research on this topic, I would like to explore more specifics. For example, I would clean the data to remove all websites that linked to either photo or Reddit itself so I could compare the usage of other college-related web pages across the years. I would also focus on the differences in elite compared to non-elite institutions. How was the process of applying to elite (and non-elite) schools different than in previous years? Lastly, I would explore the different majors and career paths. Are more people attracted to medical programs since the COVID-19 pandemic hit?

Works Cited

“Operational Strategy for K-12 Schools through Phased Prevention.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 19 Mar. 2021, www.cdc.gov/coronavirus/2019-ncov/community/schools-childcare/operation-strategy.html.

Tampa Publishing Company. “‘Not Going to Lie. It Hurts.’ Class of 2021 Tries to Stay Positive.” Tampa Bay Times, 1 Feb. 2021, www.tampabay.com/news/education/2021/02/01/not-going-to-lie-it-hurts-class-of-2021-tries-to-stay-positive/.

Originally published at https://ericchapdelaine.com.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store