During this year, the financial market is having extra turmoil due to the ongoing Russia-Ukraine Conflict. Figure 1 and 2 show the event effect of Russia-Ukraine conflict on S&P 500 index and Wheat price. This project aims to study this event from a perspective combining the commodity market with online social media Reddit. Modern finance theory holds that asset prices are mainly affected by market and psychological factors. From the market perspective, global economies are presently facing severe challenges due to food and energy shortages, since Ukraine and Russia are among the major exporters of wheat, natural gas, etc. From the behavioral finance perspective, geopolitical conflict can have an impact on investors' psychological factors, which in turn can have an impact on commodity prices. This project will explore the relationship between the food and energy commodity market and Russia-Ukraine related Reddit posts, in hope that the data can capture both market and psychological factors. If another emergent event happens in the future, a similar approach can be conducted to evaluate the impact on commodity markets.
To be able to analyze the Reddit data in an efficient and quantitative way, we borrow techniques from the burgeoning Natural Langauge Processing(NLP) field including Topic Modeling, Sentiment Analysis, Language Detection, etc. For the regression taks of predicting returns, we experiment a suite of machine learning(ML) algorithms including Lasso/Ridge Regression, Principal Component Regression, Random Forest, Gradient Boosting, and Neural Network.
As Reddit is a large community with thousands of hundreds of users that have a large volume of information, we utilize Big Data Processing tool Spark to handle the heavy data lifting. Our ultimate goal to predict commodity market returns using Reddit data breaks down into three sections. EDA section introduces the commodity and reddit datasets with their summary statistics. NLP section demonstrates our feature engineering pipeline and our two feature categories: Reddit Topics, and Reddit Sentiment. Machine Learning section explains the prediction algorithms and their results. And lastly, Conclusion section summarizes the project and discusses the next steps.
Throughout our project, we will answer below data science questions as milestones to help us reach the final goal.
Business goal: The goal is to find out topics related to Russia Ukraine Conflict and to provide guidance to develop idea about the project.
Technical proposal: Use NLP to convert unstructured text data into a numerical format. The resulting matrix will be used to topic model from large collections of documents and provide features to classify documents. Use Latent Dirichlet Allocation (LDA) to model the topic and use spaCy to remove numbers and punctuation and lemmatize the result. Visualize the Topic Modeling using pyLDAvis.
Business goal: Reddit content can have languages other than English, especially when the topic is related to geopolitical issues like the Russian-Ukraine Conflict. It will be useful to conduct text analysis in different languages.
Technical proposal: The first step is to identify which submission/comment contains foreign words using a language detector on John Snow Lab. Ukrainian and Russian are expected from the data. Following that, this project will identify keywords and phrases across different languages. Finally, it will be useful to conduct topic modeling on different languages and compare the differences.
Business goal: Within each topic, which submissions/comments are the most popular? We want to look into which articles/posts people discuss, and make the topic significant in the dataset.
Technical proposal: Based on the topic modeling from the previous step, we can convert the topic vectors in the new coordinate space back to the original document-word vectors. From the new matrix, we can see which document is the most significant in its corresponding topic. As a potential next step, we may even apply methods like word2vec/doc2vec to compute the document vectors, project them to lower dimensional(TSNE/UMAP/PCA) space, and visualize them.
Business goal: Conduct sentiment analysis on the Reddit posts to identify the most significant sentiments across periods.
Technical proposal: Combine the content from submissions and comments in our dataset. Conduct sentiment analysis using Spark NLP. Categorize submission records by different emotions (surprise, fear, sadness…). Derive time series regarding the changes in sentiments. Conduct regression analysis with variables including author age group, language, etc.
Business goal: Predict whether an Ukrainian comment receives positive or negative scores using a text classification algorithm
Technical proposal: While bag-of-word approach is useful in text classification in English, we will explore whether the same method can be worked on foreign languages. Our first step will be creating word tokens, and then indexing each word to a non-ordinal number. We will be applying the bag-of-words feature with logistic regression and naïve bayes.
Business goal: Is there any significant difference in the sentiments/topics towards the RU event between teenagers and adults? We want to get the sentiment difference between teenagers and adults towards different small events in the whole RU conflict, which can be mapped to the time axis. Then we can make different policies according to these two groups.
Technical proposal: We first group the data by date/month and over_18. Then we make a sentiment analysis of each Reddit comment/submission. We collect statistics of sentiment scores in different groups. Finally, we have the two sentiment lines plot and map them to small events.
Business goal #1: Explore the commodity price changes before and after the Russian-Ukraine Conflict. This can be used as a reference to predict what will happen to commodity prices after the conflict.
Business goal #2: What could be the potential market factor that drove the commodity price higher/lower?
Business goal #3: Explore the co-movement between different commodities.
Technical proposal: 1) Directly observe the price changes
of
different commodities since 2022 Feb from time-series charts. 2) Compare the price changes after 2022
Feb with the average price changes before that time with consideration of standard deviation. 3) It is
also useful to discuss which commodity had the largest price change, and whether that price movement
matches with economic data.
The goal is to examine economic data such as export/import from Russia and Ukraine to identify if the
commodity prices are reflecting market fundamentals. It would be interesting if some correlation can
be
found between historical export/import and commodity market prices. Having an understanding of the
fundamentals can help deliver a comprehensive reasoning behind the change in commodity prices.
Starting from the naïve Pearson correlation, this project will explore which commodities tend to move
together. Since new events can introduce changes to commodity fundamentals, it is also important to
compare the correlation before and after the conflict. To consider lag and lead effects, this project
can conduct cross-correlation analysis as a next step.
Technical proposal: Conduct a correlation analysis between all the timeseries this project has constructed so far. Have a comprehensive view of the correlations between commodity price changes, sentiments, languages, and topics. Identify the most significant pairs. Further analysis can be studied with consideration of lead-lag effects.
Technical proposal: One of the main goals of this project is to experiment with different predicting models to test if the next-day price change can be modeled by using the psychological factors we derived from previous univariate analyses. Some potential models include OLS, Lasso/Ridge Regression, Tree-based methods, and neural networks. Identify which asset can be predicted most/least accurately.
Technical proposal: After assessing the prediction results, this project can also dive deeper into the models to discuss which factors had the most impact on the predictions. Visualizations summarizing the experiments can be used. Using the analysis and intuition developed before, we can discuss the implications of the feature importance and their economic reasoning.