ANLY 502 - Group 23

Project Introduction

Analysis on the Impact of Russia-Ukraine Conflict to the Commodity Market using Reddit data

During this year, the financial market is having extra turmoil due to the ongoing Russia-Ukraine Conflict. Figure 1 and 2 show the event effect of Russia-Ukraine conflict on S&P 500 index and Wheat price. This project aims to study this event from a perspective combining the commodity market with online social media Reddit. Modern finance theory holds that asset prices are mainly affected by market and psychological factors. From the market perspective, global economies are presently facing severe challenges due to food and energy shortages, since Ukraine and Russia are among the major exporters of wheat, natural gas, etc. From the behavioral finance perspective, geopolitical conflict can have an impact on investors' psychological factors, which in turn can have an impact on commodity prices. This project will explore the relationship between the food and energy commodity market and Russia-Ukraine related Reddit posts, in hope that the data can capture both market and psychological factors. If another emergent event happens in the future, a similar approach can be conducted to evaluate the impact on commodity markets.

FIGURE 1 and 2 (left)S&P 500 index from New York Times. (right)Wheat price from Financial Times.

To be able to analyze the Reddit data in an efficient and quantitative way, we borrow techniques from the burgeoning Natural Langauge Processing(NLP) field including Topic Modeling, Sentiment Analysis, Language Detection, etc. For the regression taks of predicting returns, we experiment a suite of machine learning(ML) algorithms including Lasso/Ridge Regression, Principal Component Regression, Random Forest, Gradient Boosting, and Neural Network.

As Reddit is a large community with thousands of hundreds of users that have a large volume of information, we utilize Big Data Processing tool Spark to handle the heavy data lifting. Our ultimate goal to predict commodity market returns using Reddit data breaks down into three sections. EDA section introduces the commodity and reddit datasets with their summary statistics. NLP section demonstrates our feature engineering pipeline and our two feature categories: Reddit Topics, and Reddit Sentiment. Machine Learning section explains the prediction algorithms and their results. And lastly, Conclusion section summarizes the project and discusses the next steps.

Throughout our project, we will answer below data science questions as milestones to help us reach the final goal.

Data Science Questions to Explore

Reddit Text Analysis

Q1. (NLP) What are the major topics discussed in Reddit related to Russia Ukraine conflict?

Business goal: The goal is to find out topics related to Russia Ukraine Conflict and to provide guidance to develop idea about the project.

Technical proposal: Use NLP to convert unstructured text data into a numerical format. The resulting matrix will be used to topic model from large collections of documents and provide features to classify documents. Use Latent Dirichlet Allocation (LDA) to model the topic and use spaCy to remove numbers and punctuation and lemmatize the result. Visualize the Topic Modeling using pyLDAvis.

Reddit Text Analysis

Q2. (NLP) What are different languages other than English that can be analyzed in the Dataset?

Business goal: Reddit content can have languages other than English, especially when the topic is related to geopolitical issues like the Russian-Ukraine Conflict. It will be useful to conduct text analysis in different languages.

Technical proposal: The first step is to identify which submission/comment contains foreign words using a language detector on John Snow Lab. Ukrainian and Russian are expected from the data. Following that, this project will identify keywords and phrases across different languages. Finally, it will be useful to conduct topic modeling on different languages and compare the differences.

Reddit Text Analysis

Q3. Which topics/kinds of submissions/comments are more popular?

Business goal: Within each topic, which submissions/comments are the most popular? We want to look into which articles/posts people discuss, and make the topic significant in the dataset.

Technical proposal: Based on the topic modeling from the previous step, we can convert the topic vectors in the new coordinate space back to the original document-word vectors. From the new matrix, we can see which document is the most significant in its corresponding topic. As a potential next step, we may even apply methods like word2vec/doc2vec to compute the document vectors, project them to lower dimensional(TSNE/UMAP/PCA) space, and visualize them.

Sentiment Analysis

Q4. How is the sentiment in Reddit posts over time related to RU Conflict?

Business goal: Conduct sentiment analysis on the Reddit posts to identify the most significant sentiments across periods.

Technical proposal: Combine the content from submissions and comments in our dataset. Conduct sentiment analysis using Spark NLP. Categorize submission records by different emotions (surprise, fear, sadness…). Derive time series regarding the changes in sentiments. Conduct regression analysis with variables including author age group, language, etc.

Sentiment Analysis

Q5. Can Foreign Language Comments be used to Predict Comment Score?

Business goal: Predict whether an Ukrainian comment receives positive or negative scores using a text classification algorithm

Technical proposal: While bag-of-word approach is useful in text classification in English, we will explore whether the same method can be worked on foreign languages. Our first step will be creating word tokens, and then indexing each word to a non-ordinal number. We will be applying the bag-of-words feature with logistic regression and naïve bayes.

Sentiment Analysis

Q6. Is there difference in the sentiments among different ages?

Business goal: Is there any significant difference in the sentiments/topics towards the RU event between teenagers and adults? We want to get the sentiment difference between teenagers and adults towards different small events in the whole RU conflict, which can be mapped to the time axis. Then we can make different policies according to these two groups.

Technical proposal: We first group the data by date/month and over_18. Then we make a sentiment analysis of each Reddit comment/submission. We collect statistics of sentiment scores in different groups. Finally, we have the two sentiment lines plot and map them to small events.

Commodity + Reddit (Joining external data)

Q7. How can we relate the commodity market with the text & sentiment analysis of the Reddit dataset?

Business goal #1: Explore the commodity price changes before and after the Russian-Ukraine Conflict. This can be used as a reference to predict what will happen to commodity prices after the conflict.

Business goal #2: What could be the potential market factor that drove the commodity price higher/lower?

Business goal #3: Explore the co-movement between different commodities.

Technical proposal: 1) Directly observe the price changes of different commodities since 2022 Feb from time-series charts. 2) Compare the price changes after 2022 Feb with the average price changes before that time with consideration of standard deviation. 3) It is also useful to discuss which commodity had the largest price change, and whether that price movement matches with economic data.

The goal is to examine economic data such as export/import from Russia and Ukraine to identify if the commodity prices are reflecting market fundamentals. It would be interesting if some correlation can be found between historical export/import and commodity market prices. Having an understanding of the fundamentals can help deliver a comprehensive reasoning behind the change in commodity prices.

Starting from the naïve Pearson correlation, this project will explore which commodities tend to move together. Since new events can introduce changes to commodity fundamentals, it is also important to compare the correlation before and after the conflict. To consider lag and lead effects, this project can conduct cross-correlation analysis as a next step.

Commodity + Reddit (Joining external data)

Q8. How does each topic correlate to the commodity price changes?

Technical proposal: Conduct a correlation analysis between all the timeseries this project has constructed so far. Have a comprehensive view of the correlations between commodity price changes, sentiments, languages, and topics. Identify the most significant pairs. Further analysis can be studied with consideration of lead-lag effects.

Commodity + Reddit (Joining external data)

Q9. Can we use Reddit factors to predict commodity price change?

Technical proposal: One of the main goals of this project is to experiment with different predicting models to test if the next-day price change can be modeled by using the psychological factors we derived from previous univariate analyses. Some potential models include OLS, Lasso/Ridge Regression, Tree-based methods, and neural networks. Identify which asset can be predicted most/least accurately.

Commodity + Reddit (Joining external data)

Q10. Which psychological factors had the most impact on the prediction results?

Technical proposal: After assessing the prediction results, this project can also dive deeper into the models to discuss which factors had the most impact on the predictions. Visualizations summarizing the experiments can be used. Using the analysis and intuition developed before, we can discuss the implications of the feature importance and their economic reasoning.

Project Introduction

Analysis on the Impact of Russia-Ukraine Conflict to the Commodity Market using Reddit data

Data Science Questions to Explore

Reddit Text Analysis

Q1. (NLP) What are the major topics discussed in Reddit related to Russia Ukraine conflict?

Reddit Text Analysis

Q2. (NLP) What are different languages other than English that can be analyzed in the Dataset?

Reddit Text Analysis

Q3. Which topics/kinds of submissions/comments are more popular?

Sentiment Analysis

Q4. How is the sentiment in Reddit posts over time related to RU Conflict?

Sentiment Analysis

Q5. Can Foreign Language Comments be used to Predict Comment Score?

Sentiment Analysis

Q6. Is there difference in the sentiments among different ages?

Commodity + Reddit (Joining external data)

Q7. How can we relate the commodity market with the text & sentiment analysis of the Reddit dataset?

Commodity + Reddit (Joining external data)

Q8. How does each topic correlate to the commodity price changes?

Commodity + Reddit (Joining external data)

Q9. Can we use Reddit factors to predict commodity price change?

Commodity + Reddit (Joining external data)

Q10. Which psychological factors had the most impact on the prediction results?

About the Team

Junlin(Joey) Liu

Minglei Cai

Xinlu Liu

Hyuksoo Shin