How to build an Emotion-Based Song Retrieval Model in Python
Authors: Chengyue Qiu, Yue Ma
Introduction
There are many song retrieval systems existing in our life. When you feel happy, you might want to listen to a song. And you can open Spotify or Pandora and type in “Willow”. However, sometimes, you might not have a specific song in your mind. A random thought might be easier to come up in your mind like ‘I feel so bored at working place’ and you might want to have background music at the same time. This is the problem we want to solve, a ranked music retrieval system that can recommend songs based on your random emotional text input.
Data
To train a model that can recommend songs based on user input, we mainly need to use 2 kinds of data sources: Queries that have emotional labels and Songs with lyrics.
The queries are provided by the EmotionLines dataset, which includes two datasets with dialogues(Friends and EmotionPush). Friends dataset is speech-based and based on annotated dialogues from Friends TV show. The EmotionPush dataset is chat-based and uses real Facebook Messenger chats.
Each line object includes utterance and emotion label, which are “neutral, joy, sadness, fear, anger, surprise, disgust”. We only use sentences from Friends because Facebook Messenger chats include too many codenames. Also, we only selected sentences including more than 5 words to extract scenario information as well as emotional information.
The songs and emotion labels are provided by the IMAC database, which contains almost 4000 songs. As for lyrics, we used Genius API to add lyrics for further sentiment analysis.
# https://github.com/johnwmillr/LyricsGenius
import lyricsgenius as lg
genius = lg.Genius('Client access token', skip_non_songs=True, excluded_terms=["(Remix)", "(Live)"],remove_section_headers=True)
We also annotated the relevance score for each song based on the sentence from Friends. We chose 20 sentences to annotate and each sentence has at least 50 songs with annotated relevance judgments. Below shows the criteria of relevance score.
Method
Since we’re going to build an emotion-based music retrieval system, the inputs are the user’s input(sentence/emoji), and outputs are relevant songs. In order to achieve this goal, we used sentiment analysis techniques to predict the emotion in the given text and then utilize information retrieval algorithms to retrieve relevant songs based on lyrics and sentence.
Data pre-processing
We utilized different approaches for sentiment analysis and the music retrieval process.
In the sentiment analysis process, we mainly replaced emojis with true meaning, so that users can input emoji to retrieve songs, converted the text to lower-case, and encoded the data by emotion tag(from 7 emotions to positive, negative, and neutral). We decided not to remove all punctuation because it could be important in determining emotion.
# Encoding the data by emotion tag
sent_to_id = {"anger":0,'sadness':0,'disgust':0,'fear':0,'surprise':1, "joy":1,"neutral":2}
df["sentiment_id"] = df['emotion'].map(sent_to_id)
# function to lower vocab, remove URL, and transform emojidef cleanText(text):
text = re.sub(r'\|\|\|', r' ', text)
text = re.sub(r'http\S+', r'<URL>', text)
text = text.lower()
text= emoji.demojize(text)
return text
In the music retrieval process, we removed brackets, link breaks, non-English lyrics, Stopwords, URLs, and lemmatized the sentence.
# remove round brackets and curly brackets but not text within
song_df['lyrics'] = song_df['lyrics'].map(lambda s: re.sub(r'\(|\)', '', s))
song_df['lyrics'] = song_df['lyrics'].map(lambda s: re.sub(r'\{|\}', '', s))
# remove line breaks
song_df['lyrics'] = song_df['lyrics'].map(lambda s: re.sub(r' \n|\n', '', s))
# remove non-english songs
def get_eng_prob(text):
detections = detect_langs(text)
for detection in detections:
if detection.lang == 'en':
return detection.prob
return 0song_df['en_prob'] = song_df['lyrics'].map(get_eng_prob)
song_df = song_df.loc[song_df['en_prob'] >= 0.5]# lower case and remove urls and emoji
def lower_url(file):
file_lowered=file.lower()
file_url=re.sub(r'^https?:\/\/.*[\r\n]*', '', file_lowered, flags=re.MULTILINE) #remove url
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F"
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', file_url)
song_df['lyrics'] = song_df['lyrics'].map(lower_url)
More details on how to clean lyrics can be found in Github.
Sentiment analysis
In this step, we need to build a model to predict user inputs’ emotion, then filter the songs by sentiment before using document ranking techniques.
We tried the Basic LSTM, LSTM model with glove word embeddings, and LSTM model with word2vec embeddings because LSTM has better performance than a normal RNN in a longer sequence.
Please see the blog post to understand more about LSTM and Word Embeddings.
First, we tried a Basic LSTM model. We defined the number of max features as 2000 and used Tokenizer to vectorize and convert text into Sequences, so the Network can deal with it as input.
# tokenize and convert to sequence
max_fatures = 2000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(df2['cleaned_sentence'].values)
X = tokenizer.texts_to_sequences(df2['cleaned_sentence'].values)
X = pad_sequences(X)
We used softmax as an activation function because our LSTM Network is using categorical cross-entropy. And we used 7 epochs to train our model, then extracted a validation set and measured score and accuracy.
# initiate the LSTM model
embed_dim = 200
lstm_out = 250
model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())# train the LSTM model
batch_size = 32
model.fit(X_train, Y_train, epochs = 7, batch_size=batch_size, verbose = 2)
Because the accuracy is not so high, which will cause a larger bias in the next step, so we tried two different embeddings.
The first one is glove 6B 200d word embedding, it contains English word vectors pre-trained on the combined Wikipedia 2014 + Gigaword 5th Edition corpora (6B tokens, 400K vocab).
The second one is word2vec embeddings trained by ourselves on our sentence corpus by gensim packages.
# build word2vec model for embeddings
d2v_model = Doc2Vec(dm=1, dm_mean=1, size=20, window=8, min_count=1, workers=1, alpha=0.065, min_alpha=0.065)
d2v_model.build_vocab([x for x in tqdm(train_tagged.values)])# save the vectors in a new matrix for model training
embedding_matrix = np.zeros((len(d2v_model.wv.vocab)+ 1, 20))
for i, vec in enumerate(d2v_model.docvecs.vectors_docs):
while i in vec <= 1000:
embedding_matrix[i]=vec
Document ranking techniques
We used two main ranking techniques here. First BM25 series like BM25, BM25+, BM25L. Those methods are trying to find the similarity between sentences using term frequency. In a simple word, if your sentence has more words appears in a song, then this song will be ranked high.
Another method is the semantic analysis. Here we mainly use Latent Semantic Analysis. Instead of using term frequency to show the similarity between sentences, LSA cares more about the semantic meaning behind a sentence.
Evaluation and Baseline
Metrics
In our research, we are using two metrics to evaluate our final result. Mean Average Precision (MAP) and Discounted cumulative gain before 10 (NDCG@10).
To set up our evaluation metrics, here are some things we did. Since our ground truth doesn’t take all songs into account, it might be possible that our model will retrieve some songs that are relevant to our query but don’t have a rating score to compare with. To solve this problem, for each query we used to test our model, we will only evaluate the songs we annotated for this query in our ground truth table.
For example, query one retrieved 1,2,2,nan,nan,nan,2. In our case, we will not take those nan values into account but find 10 executive songs that have annotated ratings and then calculate their NDCG@10 and MAP values.
How to calculate MAP:
We assume that a song with a score larger than 1 is a relevant song (True Positive). For each song in our selected list, we calculate the precision (true positive/positive result). Here is the formula to calculate AP.
How to calculate NDCG:
For the ranked songs, we will first calculate DCG by calculating relevance score/log2(rank+1) for each item and sum them together. Then we will calculate IDCG (ideal DCG) by relevance score/log2(ideal rank +1). NDCG=DCG/IDCG. In the real industry, people prefer to use the formula below to calculate DCG which can give more penalty to lower-ranked cases.
The reason why we want to use two evaluation metrics is because NDCG cares more about whether your ranking is well ordered. However, since in our case, we have different degrees of relevance score, if the order of rank is correct but there is no rating 2 song retrieved by our model, NDCG will still consider it as a good result. To compensate for this drawback of the NDCG score, we want to include MAP to see whether we retrieved the most relevant songs by our model.
Here is an example of two query result which can represent we I mean:
Q1: 1 2 2 -1 -1
Q2: 1 1 1 0 0
MAP(Q2) < MAP (Q1) and NDCG(Q2) > NDCG (Q1).
Baseline
To create our baseline, we used the basic BM25 to rank all songs in our dataset and then retrieve the top 10 songs with the highest BM25 values. While using BM25, we simply split the query into a bag of words and then calculated the final BM25 score without doing any preprocessing. Then we use these songs to calculate our final result.
Here is the result of our baseline. NDCG values for some queries are extremely high. However, MAP is 0 for all results. The final MAP and NDCG@10 for our baseline are 0 and 0.969. Our baseline has a very high NDCG@10 score but the extremely low MAP score. The reason is that our model didn’t retrieve any relevant information at all. Some example of retrieved songs are all -2. NDCG metrics assume this is the ideal rank. So NDCG is 1 for those query results. Actually, this is a really bad result. That is why we want to include MAP to see how many true positives we get.
So we want to improve our MAP in the future.
Result and Conclusion
As for sentiment analysis, we used accuracy and loss as evaluation methods. We can see that accuracy will increase and losses will decrease while the number of epochs is increasing.
Because the LSTM model with word2vec embedding has the best performance, we used it as our final model to predict the emotion of user inputs.
As for Document Ranking,
Interpretation of results
For our model result, we greatly improved MAP value but our NDCG didn’t increase a lot. In terms of performance, semantic analysis performs much better than BM25. The reason is our system is emotional-based song retrieval. Rather than just using term frequency to find similarity between query and documents, it’s better to find the semantic or topics behind them then the retrieved songs will match better with query input.
Our approach did well overall, the performance can satisfy an end-user for emotional based searching. Firstly, our final system can classify query’s emotion and then filter out songs with corresponding emotions. And then, we will do a semantic analysis of user query and song lyrics to get ranked songs for our users. This way greatly helps us shrink the range of songs to retrieve
which can highly improve our final result.
Although our model didn’t improve NDCG values a lot but improved MAP by 13%. That’s a lot and that means our models are able to get the most relevant songs to emotions and also be able to rank songs in good order. Since our baseline has a fake high number, so it makes sense we didn’t perform better than baseline. The low MAP is because we assume a song's score at 2 is relevant. However, we didn’t have enough songs labeled as 2. So it might be possible some queries will get no relevant songs. That will decrease the value of MAP greatly. So in order to improve this performance, it can be better to include songs with score 1 as relevant as well.
For our final model,iIn the first sentiment analysis step, we chose the LSTM model with word2vec embeddings as our final model. In the second document ranking technique, we chose the LSI model as our final model to retrieve songs.
Interactive Interface
You can run Prediction_Model.py to interact with this tool
Future Work
If we have longer time, we are determined to do the following steps:
- Better Annotation:
In our current research each of our team members annotated 50 songs for 10 different queries, in total 20 queries and 1000 songs. There might exist some bias during the annotation process. If we could have more time to annotate, we will include some other experts to create a wealthy dataset.
2. Use Flask to build a front-end interface
Currently, we don’t have enough time to build a front-end interface, if possible, we would like to use Flask and HTML/CSS to build a web-based interface, users can get songs and link to Spotify after they input some sentences.
3. Try more model to compare the performance
For the sentiment analysis part, we would like to try the Roberta Base Model and further tune hyperparameters to increase its accuracy. As for document ranking techniques, we would like to try latent semantic analysis.
4. Add features of image input
We would like to add another option to upload images and match relevant songs in the future.
Github link: https://github.com/amorqiu/Emotional-song-Retrival