Introduction & Overview¶
As a society, we too often think of birth control as just a feminine issue. The consequences of a pregnancy more directly impact women, so the responsibility for birth control falls largely on their shoulders. In online communities like reddit (specifically r/birthcontrol), it's obvious that the community is dominated by women posting. Though the posts are few and far between, however, there are men engaging in discussion. In a female-dominated space like this, I'm curious if men contribute in a markedly different manner than women. I'm curious if it would be possible to predict the gender of the poster based on the language used in the post.
I hypothesize that the posts from female authors and male authors will be significantly different and that it will be possible to successfully predict gender using word embeddings and Logistic Regression.
The data for this project will be scraped from posts in the r/birthcontrol subreddit. Reddit accounts, which are notoriously easy to make, provide a thick cloak of anonymity. Names of posters are usually randomly generated, providing no hints to the gender or identity of the poster. Luckily, the context here lends itself to some other tools for classifying gender. For part 2 of this project, I will use spacy's natural language processing to read part of speech tags, and count pronouns used in each post. I will also count words used by the poster like "boyfriend" or "girlfriend" (I call these "relationship words" throughout). From here I will assign a "gold label" gender. In part 3 of the project, I will then remove all pronouns and relationship words from the text of each post and return a word embedding vector that is the average of the vectors of the non-stopword tokens in the input text. Finally, in part 4 of the project, I will use a matrix of these vectors as an input to a Logistic Regression Classifier, and improve the classifier by optimizing the parameters using GridSearchCV. Part 5 is the “bonus” section of the project, and will contain some simple experiments like sentiment scoring and visualizations.
Data & Methods¶
Part 1: Scraping Reddit Data from r/birthcontrol¶
Heavily relying on lecture code from Maria Antoniak's research, I scraped the data from 16K r/birthcontrol posts spanning the years 2018 through 2022. I then removed duplicates, removed posts with the word "bot," and removed posts with less than 5 different words, resulting in a data frame structure to work with. I took data only from original posts, not comments in reply treads. I reviewed the columns of data I scraped, and removed those that were uninteresitng to me, including url, created_utc, and parent. The dataset is limited in that I was only able to find 503 male-authored posts (roughly 3% of all the posts). With this, I sampled the same amount of posts from the female-authored posts. As a result, I was limited to analysis on only 1006 posts in total, a smaller corpus than I had inteded to work with. This limitation comes as a result of my expectations that male-authored posts would occur more frequently than they actually did.
Part 2: Assigning Gold Label Gender of Poster¶
This project makes major assumptions of cis-gender, heteronormative couples. Assumptions like these should not be made lightly, and point to the rudimentary nature of this project. Based on the majority of posts, I assume couples that are in need of birth control are composed of a male and a female partner.
With this assumption, I relied upon relationship words used by the poster like "boyfriend" or "girlfriend" to determine the poster gender. For example, a poster that writes about what her boyfriend did can be assumed to be a woman (again, a problematic assumption, but in the context of birth control and this exploratory project, it's what I'm working with). Additionally, because discussions of birth control are by far and large centered around the female body, I rely upon the assumption that male posters use more third person pronouns talking about their partner (like "she" and "her") while female posters use more first person pronouns like "I" "my" as well as male third person pronouns like "he."
To create my gold labels for gender, I counted the number of these relationship words and each category of pronouns. I then reviewed the first 100 posts to get a sense for how many of each could be sufficient to create a gold label. I determined that using a relationship word even just once (while also using none of the opposite relationship words) was sufficient to label the gender, and if there were no relationship words present, then the label could be created based on which pronoun category had more in it. I felt comfortable choosing which one had more, because most of the posts had many of one category and very few or none in the other. If a post had no relationship words and an equal number of pronouns, then I did not classify it as a gender and eventually removed it from the dataset. After establishing these rules and assigning the labels, I also spot checked these no-gender posts, to ensure that there was no indication of gender that I had missed. I understand that these labels cannot always be accurate, but through rigorous targeted spot-checking, I believe they are an acceptable standard.
Part 3: Remove Pronouns and Relationship Words from the Texts & Create a Matrix of Word Embeddings¶
Determining gender using pronouns and relationship words is overall uninteresting. There’s not much that can be said about men using the word “girlfriend” and women using the word “boyfriend.” What is interesting, however, is if men and women write posts different aside from these relationship words and pronouns. This is the purpose of the project. In part 3, I used spacy natural language processing to identify parts of speech, then remove pronouns and the identified relationship words from the text of each post. This way, I would be able to determine if men and women write reddit posts differently aside from the functional differences in pronouns and obvious gendered words. From the remaining tokens for each text, I returned a word embedding vector that is the average of the vectors of the non-stopword tokens in the input post text. From these embedding vectors, I created a matrix to be used in logistic regression.
Part 4: The Big Moment!! Logistic Regression!¶
I used the embedded_vector matrix based on the text without the pronouns or relationship words to predict the gender of the poster. I used the sklearn Logistic Regression model with default parameters and calculated a cross validated score. I then used SelectKBest with the top 50 features. Finally, I found the optimal parameters using GridSearchCV and ran the model with these ideal parameters. I then calculated the cross validated score for this model with optimal parameters.
Part 5: Bonus / Fun Little Explorations¶
Bonus: part 5 included small, simple experiments that explored the context of my hypothesis. I created two word clouds of the words used most frequently in either male-authored or female-authored posts. I also ran a t-test to determine if the differences in the word use frequency lists were statistically significant comparing male and female posts. Finally, I sentiment scored the posts and compared the average scores between male and female authored posts.
Results¶
Part 1: Scraping Reddit Data from r/birthcontrol¶
Modified from Maria Antoniak's Code from lecture
First, installing necessary packages
import sys
!{sys.executable} -m pip install psaw little_mallet_wrapper Levenshtein
Collecting psaw
Downloading psaw-0.1.0-py3-none-any.whl (15 kB)
Collecting little_mallet_wrapper
Downloading little_mallet_wrapper-0.5.0-py3-none-any.whl (19 kB)
Collecting Levenshtein
Downloading Levenshtein-0.18.1-cp39-cp39-macosx_10_9_x86_64.whl (242 kB)
|████████████████████████████████| 242 kB 2.8 MB/s eta 0:00:01
Requirement already satisfied: Click in /opt/anaconda3/envs/3350/lib/python3.9/site-packages (from psaw) (8.0.3)
Requirement already satisfied: requests in /opt/anaconda3/envs/3350/lib/python3.9/site-packages (from psaw) (2.27.1)
Collecting rapidfuzz<3.0.0,>=2.0.1
Downloading rapidfuzz-2.0.11-cp39-cp39-macosx_10_9_x86_64.whl (1.6 MB)
|████████████████████████████████| 1.6 MB 2.8 MB/s eta 0:00:01
Collecting jarowinkler<1.1.0,>=1.0.2
Downloading jarowinkler-1.0.2-cp39-cp39-macosx_10_9_x86_64.whl (72 kB)
|████████████████████████████████| 72 kB 1.9 MB/s eta 0:00:01
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/envs/3350/lib/python3.9/site-packages (from requests->psaw) (2021.10.8)
Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/envs/3350/lib/python3.9/site-packages (from requests->psaw) (3.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/anaconda3/envs/3350/lib/python3.9/site-packages (from requests->psaw) (1.26.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/anaconda3/envs/3350/lib/python3.9/site-packages (from requests->psaw) (2.0.4)
Installing collected packages: jarowinkler, rapidfuzz, psaw, little-mallet-wrapper, Levenshtein
Successfully installed Levenshtein-0.18.1 jarowinkler-1.0.2 little-mallet-wrapper-0.5.0 psaw-0.1.0 rapidfuzz-2.0.11
!{sys.executable} -m pip install tomotopy
Collecting tomotopy
Downloading tomotopy-0.12.2-cp39-cp39-macosx_10_14_x86_64.whl (14.5 MB)
|████████████████████████████████| 14.5 MB 8.1 MB/s eta 0:00:01 |██▋ | 1.2 MB 901 kB/s eta 0:00:15 |█████████ | 4.1 MB 4.9 MB/s eta 0:00:03
Requirement already satisfied: numpy>=1.11.0 in /opt/anaconda3/envs/3350/lib/python3.9/site-packages (from tomotopy) (1.21.2)
Installing collected packages: tomotopy
Successfully installed tomotopy-0.12.2
#importing necessary packages
from datetime import datetime
import os
import glob
import pandas as pd
from psaw import PushshiftAPI
base_path = os.path.join('reddit_data') # creating a directory for the data
if not os.path.exists(base_path): # if it does not exist
os.makedirs(base_path) # create it
Scraping reddit posts:
""" Maria Antoniak's code with minor modifications """
def scrape_posts_from_subreddit(subreddit, api, year, month, end_date):
'''
Takes the name of a subreddit, the PushshiftApi, a year and month to scrape from
'''
start_epoch = int(datetime(year, month, 1).timestamp()) # convert date into unicode timestamp
end_epoch = int(datetime(year, month, end_date).timestamp())
gen = api.search_submissions(after=start_epoch,
before=end_epoch,
subreddit=subreddit,
filter=['url', 'author', 'created_utc', # info we want about the post
'title', 'subreddit', 'selftext',
'num_comments', 'score', 'link_flair_text', 'id'])
max_response_cache = 100000
scraped_posts = []
for _post in gen:
scraped_posts.append(_post)
if len(scraped_posts) >= max_response_cache: # avoid requesting more posts than allowed
break
scraped_posts_df = pd.DataFrame([p.d_ for p in scraped_posts])
return scraped_posts_df
Scraping reddit comments:
""" Maria Antoniak's code with minor modifications """
def scrape_comments_from_subreddit(subreddit, api, year, month, end_date):
'''
Takes the name of a subreddit, the PushshiftApi, a year and month to scrape from
'''
start_epoch = int(datetime(year, month, 1).timestamp()) # convert date into unicode timestamp
end_epoch = int(datetime(year, month, end_date).timestamp())
gen = api.search_comments(after=start_epoch,
before=end_epoch,
subreddit=subreddit,
filter=['author', 'body', 'created_utc', # info we want about the comment
'id', 'link_id', 'parent_id',
'reply_delay', 'score', 'subreddit'])
max_response_cache = 100000
scraped_comments = []
for _comment in gen:
scraped_comments.append(_comment)
if len(scraped_comments) >= max_response_cache: # avoid requesting more posts than allowed
break
scraped_comments_df = pd.DataFrame([p.d_ for p in scraped_comments])
return scraped_comments_df
Scrape the subreddit:
""" Maria Antoniak's code with minor modifications """
def scrape_subreddit(_target_subreddits, _target_types, _years):
'''
Takes a list of subreddits, a list of types of content to scrape, and a list of years to scrape from
'''
api = PushshiftAPI()
print('Number of PushshiftApi shards that are not working:', api.metadata_.get('shards')) # check if any Pushshift shards are down!
for _subreddit in _target_subreddits:
for _target_type in _target_types:
for _year in _years:
if _year < 2022:
months = [3, 4]
end_dates = [31, 30]
elif _year == 2022:
months = [3, 4] # months to scrape
end_dates = [31, 30] # last day of the month
for _month, _end_date in zip(months, end_dates):
_output_directory_path = os.path.join(base_path, _subreddit, _target_type) # directory to store scraped data
# by subreddit and type of content
if not os.path.exists(_output_directory_path): # if it does not exist
os.makedirs(_output_directory_path) # create it!
_file_name = _subreddit + '-' + str(_year) + '-' + str(_month) + '.pkl' # filename of the csv with scraped data
# scrape only if output file does not already exist
if _file_name not in os.listdir(_output_directory_path):
print(str(datetime.now()) + ' ' + ': Scraping r/' + _subreddit + ' ' + str(_year) + '-' + str(_month) + '...')
if _target_type == 'posts':
_posts_df = scrape_posts_from_subreddit(_subreddit, api, _year, _month, _end_date)
if not _posts_df.empty:
_posts_df.to_pickle(os.path.join(_output_directory_path, _file_name), protocol=4)
if _target_type == 'comments':
_comments_df = scrape_comments_from_subreddit(_subreddit, api, _year, _month, _end_date)
if not _comments_df.empty:
_comments_df.to_pickle(os.path.join(_output_directory_path, _file_name), protocol=4)
print(str(datetime.now()) + ' ' + ': Done scraping!')
Calling the scraping functions that Maria made:
target_subreddits = ['birthcontrol'] # subreddits to scrape
target_types = ['posts'] # type of content to scrape
years = [2018,2019,2020,2021,2022] # years to scrape
scrape_subreddit(target_subreddits, target_types, years)
Number of PushshiftApi shards that are not working: None 2022-05-16 14:45:44.138515 : Scraping r/birthcontrol 2018-3...
2022-05-16 14:45:48.948862 : Scraping r/birthcontrol 2018-4...
2022-05-16 14:45:57.547762 : Scraping r/birthcontrol 2019-3...
2022-05-16 14:46:21.991454 : Scraping r/birthcontrol 2019-4...
2022-05-16 14:46:32.819431 : Done scraping!
Combining the posts / comments, consolidating the data:
def combine_one_subreddit(_subreddit): # creating csv with all of a subreddit's posts and comments
df_d = {'author': [], 'id': [], 'type': [], 'text': [], # create a dictionary
'url': [], 'link_id': [], 'parent_id': [],
'subreddit': [], 'created_utc': []}
subreddit_pkl_path = os.path.join('reddit_data', _subreddit, f'{_subreddit}.pkl') # file with all the data
if not os.path.exists(subreddit_pkl_path): # if file does not exist
for target_type in ['posts']:
files_directory_path = os.path.join('reddit_data', _subreddit, target_type) # directory where scraped data is depending on subreddit and type of content
all_target_type_files = glob.glob(os.path.join(files_directory_path, "*.pkl")) # select all appropriate pickle files
for f in all_target_type_files: # we read each pickle file and include the info we want in the dictionary
df = pd.read_pickle(f)
if target_type == 'posts':
for index, row in df.iterrows():
df_d['author'].append(row['author'])
df_d['id'].append(f"{row['subreddit']}_{row['id']}_post") # id of the post, 'Endo_xyz123_post'
df_d['type'].append('post')
df_d['text'].append(row['selftext']) # textual content of the post
df_d['url'].append(row['url']) # url of the post
df_d['link_id'].append('N/A')
df_d['parent_id'].append('N/A')
df_d['subreddit'].append(row['subreddit'])
df_d['created_utc'].append(row['created_utc']) # utc time stamp of the post
subreddit_df = pd.DataFrame.from_dict(df_d) # create pandas dataframe from dictionary
subreddit_df.sort_values('created_utc', inplace=True, ignore_index=True) # order dataframe by date of post
subreddit_df['time'] = pd.to_datetime(subreddit_df['created_utc'], unit='s').apply(lambda x: x.to_datetime64()) # convert timestamp to date
subreddit_df['date'] = subreddit_df['time'].apply(lambda x: str(x).split(' ')[0])
subreddit_df['year'] = subreddit_df['time'].apply(lambda x: str(x).split('-')[0])
subreddit_df.drop(columns=['time'])
subreddit_df.to_pickle(subreddit_pkl_path, protocol=4) # saving it to pickle format
Calling the combining function
for subreddit in target_subreddits:
combine_one_subreddit(subreddit)
Importing some more packages
import re
from collections import Counter
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
Printing the length of the dataframe, how many posts
df = pd.read_pickle(os.path.join('reddit_data', 'birthcontrol', 'birthcontrol.pkl'))
df = df.dropna()
print(len(df))
16554
Print function modified from Maria's
def print_info(df, _type):
if _type != 'corpus':
vectorizer = CountVectorizer( # Token counts with stopwords
input = 'content', # input is a string of texts
encoding = 'utf-8',
strip_accents = 'unicode',
lowercase = True
)
texts = df['text'].astype('string').tolist()
X = vectorizer.fit_transform(texts)
print(f"Total vectorized words in the corpus of {_type}:", X.sum())
print(f"Average vectorized {_type} length:", int(X.sum()/X.shape[0]), "tokens")
else:
vectorizer = CountVectorizer(
input = 'content',
encoding = 'utf-8',
strip_accents = 'unicode',
lowercase = True,
stop_words = 'english' # remove stopwords
)
texts = df['text'].astype('string').tolist()
X = vectorizer.fit_transform(texts)
sum_words = X.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
word_dict = {}
for a, b in words_freq:
word_dict.setdefault(a, []).append(b)
return word_dict
#https://www.geeksforgeeks.org/python-convert-list-tuples-dictionary/
Find and remove duplicate posts
def find_duplicates(_df): # function to find duplicated posts in the data
prev_doc = ''
map_dict = {} # dict of authors' posts
duplicate_indexes = [] # list of duplicates' indexes for removal from dataframe
for index, row in _df.iterrows(): # iterate over posts
author = row['author']
doc = row['text']
# if author info is available we compare each post with previous ones by the same author
# we compare/calculate the similarity between the posts using the Levenshtein distance
if author != '[deleted]':
if author in map_dict.keys():
flag = 0
idx = 0
while idx < len(map_dict[author]) and flag == 0:
lev = Levenshtein.ratio(doc, map_dict[author][idx])
if lev > 0.99:
duplicate_indexes.append(index)
flag = 1
idx += 1
if flag == 0:
map_dict[author].append(doc)
else:
map_dict[author] = [doc]
# if author info is not available we compare each post with the preceding one chronologically
else:
lev = Levenshtein.ratio(row['text'], prev_doc)
if lev > 0.90:
duplicate_indexes.append(index)
prev_doc = doc
return duplicate_indexes
dupes = find_duplicates(df) # find duplicates
df.drop(dupes, inplace=True) # removing duplicates
print(f'Number of duplicates: {len(dupes)}')
Number of duplicates: 145
Clean data, remove bot posts and posts that have <5 different words
def cleaning_docs(raw_df, _subreddit):
'''
Takes the full corpus, a file path. It cleans all the documents (removes punctuation and stopwords). It saves the clean corpus in a json file
'''
clean_docs_file = os.path.join('reddit_data', _subreddit, f'clean_{_subreddit}.pkl')
if not os.path.exists(clean_docs_file):
clean_d = {'id':[], 'clean':[], 'og':[], 'year':[], 'date':[]}
for index, row in raw_df.iterrows(): # iterating over posts and comments
if 'bot' not in row['author'] and 'Bot' not in row['author']: # if author is not a bot
clean_doc_st = lmw.process_string(row['text']) # cleaning documents
clean_doc_l = [t for t in clean_doc_st.split(' ')]
if len(set(clean_doc_l))>5 and 'bot' not in clean_doc_l: # exclude posts that have less than 5 different words
# or that contain word 'bot'
clean_d['clean'].append(clean_doc_l)
clean_d['id'].append(row['id'])
clean_d['og'].append(row['text'])
clean_d['year'].append(row['year'])
clean_d['date'].append(row['date'])
with open(clean_docs_file, 'w') as jsonfile: # creating a file with the dict of documents to topic model
json.dump(clean_d, jsonfile)
import json
import little_mallet_wrapper as lmw
import Levenshtein
%%time
for subreddit in target_subreddits:
cleaning_docs(df, subreddit)
CPU times: user 6.15 s, sys: 177 ms, total: 6.33 s Wall time: 6.53 s
df.head() #take a look at the data
| author | id | type | text | url | link_id | parent_id | subreddit | created_utc | time | date | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Analmarsh | birthcontrol_812tvt_post | post | Ok here’s my situation. It’s wednesday. I had ... | https://www.reddit.com/r/birthcontrol/comments... | N/A | N/A | birthcontrol | 1519880832 | 2018-03-01 05:07:12 | 2018-03-01 | 2018 |
| 1 | the_crane_wife | birthcontrol_8132ww_post | post | Hi ladies, \nSo I've had my IUD for over a yea... | https://www.reddit.com/r/birthcontrol/comments... | N/A | N/A | birthcontrol | 1519883627 | 2018-03-01 05:53:47 | 2018-03-01 | 2018 |
| 2 | danascullyvevo | birthcontrol_81350s_post | post | Hi y'all!\n\nSo question -- I have Kyleena IUD... | https://www.reddit.com/r/birthcontrol/comments... | N/A | N/A | birthcontrol | 1519884316 | 2018-03-01 06:05:16 | 2018-03-01 | 2018 |
| 3 | alyssas_888 | birthcontrol_813vd0_post | post | Did anyone else have low libido and anxiety af... | https://www.reddit.com/r/birthcontrol/comments... | N/A | N/A | birthcontrol | 1519893837 | 2018-03-01 08:43:57 | 2018-03-01 | 2018 |
| 4 | holycornchips | birthcontrol_8154qq_post | post | I've learned a great deal from this sub and it... | https://www.reddit.com/r/birthcontrol/comments... | N/A | N/A | birthcontrol | 1519909834 | 2018-03-01 13:10:34 | 2018-03-01 | 2018 |
After looking at some of the comment posts, I realized that it would be better to only look original posts. A lot of the comments were short little phases like "thank you" or "I'm so happy for you" where there would no way to determine the gender of the poster. Because of this, I subset the dataframe to just posts.
posts = df[df.type == 'post']
posts1 = posts.reset_index()
posts2 = posts1.drop(columns = ['url','link_id','parent_id','created_utc','index','type','subreddit'])
posts2.head()
| author | id | text | time | date | year | |
|---|---|---|---|---|---|---|
| 0 | Analmarsh | birthcontrol_812tvt_post | Ok here’s my situation. It’s wednesday. I had ... | 2018-03-01 05:07:12 | 2018-03-01 | 2018 |
| 1 | the_crane_wife | birthcontrol_8132ww_post | Hi ladies, \nSo I've had my IUD for over a yea... | 2018-03-01 05:53:47 | 2018-03-01 | 2018 |
| 2 | danascullyvevo | birthcontrol_81350s_post | Hi y'all!\n\nSo question -- I have Kyleena IUD... | 2018-03-01 06:05:16 | 2018-03-01 | 2018 |
| 3 | alyssas_888 | birthcontrol_813vd0_post | Did anyone else have low libido and anxiety af... | 2018-03-01 08:43:57 | 2018-03-01 | 2018 |
| 4 | holycornchips | birthcontrol_8154qq_post | I've learned a great deal from this sub and it... | 2018-03-01 13:10:34 | 2018-03-01 | 2018 |
Part 2: Assigning Gold Label Gender of Poster¶
Count the number of "female speaker" pronouns and "male speaker" pronouns. Again, assuming heteronormativity, a female talking about her partner would use 'I','me','my','mine','his','himself','he', while a man talking about his partner would use 'she', 'her', 'hers', 'herself'. Additionally, a female in a relationship would refer to her partner as a 'boyfriend','husband','bf' while a male would refer to his partner as 'girlfriend','wife','gf'.
import spacy
nlp = spacy.load("en_core_web_lg")
def get_pronouns(input_string, nlp):
lemmatized = nlp(input_string)
#create pronoun lists
f_speaker = ['I','me','my','mine','his','himself','he']
fs_list = []
m_speaker = ['she', 'her', 'hers', 'herself']
ms_list = []
#create gendered word lists
f_words = ['boyfriend','husband','bf']
fw = []
m_words = ['girlfriend','wife','gf']
mw = []
#add token to appropriate lists
for token in lemmatized:
if token.pos_ == 'PRON':
if token.lemma_ in f_speaker:
fs_list.append(token.lemma_)
elif token.lemma_ in m_speaker:
ms_list.append(token.lemma_)
else:
if token.lemma_ in m_words:
mw.append(token.lemma_)
elif token.lemma_ in f_words:
fw.append(token.lemma_)
return fs_list,ms_list,fw,mw
#create new columns for the word list counts we made
posts2['Pronouns_F'] = 0
posts2['Pronouns_M'] = 0
posts2['Words_F'] = 0
posts2['Words_M'] = 0
posts2['Poster_Gender'] = None
posts2.head() # display the dataframe with new columns
| author | id | text | time | date | year | Pronouns_F | Pronouns_M | Words_F | Words_M | Poster_Gender | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Analmarsh | birthcontrol_812tvt_post | Ok here’s my situation. It’s wednesday. I had ... | 2018-03-01 05:07:12 | 2018-03-01 | 2018 | 0 | 0 | 0 | 0 | None |
| 1 | the_crane_wife | birthcontrol_8132ww_post | Hi ladies, \nSo I've had my IUD for over a yea... | 2018-03-01 05:53:47 | 2018-03-01 | 2018 | 0 | 0 | 0 | 0 | None |
| 2 | danascullyvevo | birthcontrol_81350s_post | Hi y'all!\n\nSo question -- I have Kyleena IUD... | 2018-03-01 06:05:16 | 2018-03-01 | 2018 | 0 | 0 | 0 | 0 | None |
| 3 | alyssas_888 | birthcontrol_813vd0_post | Did anyone else have low libido and anxiety af... | 2018-03-01 08:43:57 | 2018-03-01 | 2018 | 0 | 0 | 0 | 0 | None |
| 4 | holycornchips | birthcontrol_8154qq_post | I've learned a great deal from this sub and it... | 2018-03-01 13:10:34 | 2018-03-01 | 2018 | 0 | 0 | 0 | 0 | None |
posts_small = posts2
#this was my tester but now I'm just using the whole thing
Count the length of the lists we made, then add the values to the appropriate columns in the df
string_df = posts1.text[1]
for x in range(len(posts_small['text'])):
result = get_pronouns(posts2.text[x],nlp)
posts_small.Pronouns_F[x] = len(result[0])
posts_small.Pronouns_M[x] = len(result[1])
posts_small.Words_F[x] = len(result[2])
posts_small.Words_M[x] = len(result[3])
Display the data frame with the counts of how many pronouns of each speaker type, and how many male and female relationship words were used. Although I only displayed the first 5 rows here for the sake of space, I looked at over 100 rows myself to get a sense for how many pronouns / relationship words would be enough to classify the gender.
It became obvious that if the word boyfriend or girlfriend (or more often gf or bf), the gender of the poster was easy to to detect. Females talk about their boyfriends, males talk about their girl friend. With this, I noticed even just one use of one of these words (in combination with zero uses of the oppositve gender counterpart) was enough to comfortably classify the gender.
For the posts that did not have this clear situation, I determined the gender based on if there were more male or female speaker pronouns. I felt generally comfortable with this because most of the rows had many of one type and near none of another.
posts_small.head()
| author | id | text | time | date | year | Pronouns_F | Pronouns_M | Words_F | Words_M | Poster_Gender | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Analmarsh | birthcontrol_812tvt_post | Ok here’s my situation. It’s wednesday. I had ... | 2018-03-01 05:07:12 | 2018-03-01 | 2018 | 8 | 0 | 0 | 0 | 0 |
| 1 | the_crane_wife | birthcontrol_8132ww_post | Hi ladies, \nSo I've had my IUD for over a yea... | 2018-03-01 05:53:47 | 2018-03-01 | 2018 | 11 | 4 | 0 | 0 | 0 |
| 2 | danascullyvevo | birthcontrol_81350s_post | Hi y'all!\n\nSo question -- I have Kyleena IUD... | 2018-03-01 06:05:16 | 2018-03-01 | 2018 | 12 | 0 | 0 | 0 | 0 |
| 3 | alyssas_888 | birthcontrol_813vd0_post | Did anyone else have low libido and anxiety af... | 2018-03-01 08:43:57 | 2018-03-01 | 2018 | 6 | 0 | 0 | 0 | 0 |
| 4 | holycornchips | birthcontrol_8154qq_post | I've learned a great deal from this sub and it... | 2018-03-01 13:10:34 | 2018-03-01 | 2018 | 7 | 0 | 0 | 0 | 0 |
Executing the assignment of gender label as outlined above. If there are any male relationship words and no female relationship words, classify as male, if the opposite is true classify as female. If neither case is true, classify based on if there are more male or female speaker pronouns. If there is no gender assigned, add it to a list so I can look at it later.
no_gender = []
for x in range(len(posts_small)):
if posts_small.Words_F[x] == 0 and posts_small.Words_M[x] >0:
posts_small.Poster_Gender[x] = 1
elif posts_small.Words_M[x] == 0 and posts_small.Words_F[x] >0:
posts_small.Poster_Gender[x] = 0
elif posts_small.Pronouns_M[x] > posts_small.Pronouns_F[x]:
posts_small.Poster_Gender[x] = 1
elif posts_small.Pronouns_F[x] > posts_small.Pronouns_M[x]:
posts_small.Poster_Gender[x] = 0
else:
no_gender.append(x)
I looked through the posts that didn't get assigned a gender. Most of them had no pronouns at all, and most were asking questions. I printed a few to demonstrate. I felt comfortable leaving these unclassified and not worrying about them from here on out.
print(posts_small.text[5])
print(posts_small.text[150])
print(posts_small.text[273])
#also deleted, ads, or equal # male/female words
Has anyone ever taken plan b and MONTHS later felt insane? If you take the simulated period pills (placebo), will you miss that simulated period if you're pregnant, just like you would miss a "real" period? do the ingredients in mucinex affect nuvaring ?
Here is the code I ran to look through the posts, commented out because it printed so much text.
#check to see if there is something missing here that could indicate gender
#for x in no_gender:
#print(posts_small.text[x])
#I explored this myself, but the output is long to look through
Print out the dataframe to look at classifications
posts_small.head()
| author | id | text | time | date | year | Pronouns_F | Pronouns_M | Words_F | Words_M | Poster_Gender | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Analmarsh | birthcontrol_812tvt_post | Ok here’s my situation. It’s wednesday. I had ... | 2018-03-01 05:07:12 | 2018-03-01 | 2018 | 8 | 0 | 0 | 0 | 0 |
| 1 | the_crane_wife | birthcontrol_8132ww_post | Hi ladies, \nSo I've had my IUD for over a yea... | 2018-03-01 05:53:47 | 2018-03-01 | 2018 | 11 | 4 | 0 | 0 | 0 |
| 2 | danascullyvevo | birthcontrol_81350s_post | Hi y'all!\n\nSo question -- I have Kyleena IUD... | 2018-03-01 06:05:16 | 2018-03-01 | 2018 | 12 | 0 | 0 | 0 | 0 |
| 3 | alyssas_888 | birthcontrol_813vd0_post | Did anyone else have low libido and anxiety af... | 2018-03-01 08:43:57 | 2018-03-01 | 2018 | 6 | 0 | 0 | 0 | 0 |
| 4 | holycornchips | birthcontrol_8154qq_post | I've learned a great deal from this sub and it... | 2018-03-01 13:10:34 | 2018-03-01 | 2018 | 7 | 0 | 0 | 0 | 0 |
Subset the dataframe into two dataframes, one for male and one for female. Then print how many rows are male:
male = posts_small[posts_small.Poster_Gender == 1]
female = posts_small[posts_small.Poster_Gender == 0]
male.head(10)
print(len(male))
509
How many male rows there are
Because there was only 500, I started going through them by hand to determine if the classification was correct. Overall, I was really happy with the results, and felt comfortable trusting these labels. I did find a few errors that were interesting to note, printed below:
male.text[118]
'I have an IUD and I’m worried it might be displaced. I have been feeling pain and depression for a while now but I have the IUD for a uterus illness so that is likely why that exists, it just wasn’t as bad until after I got the IUD put in. \n\nI am aware that more muscle contractions means more likelihood of displacement. I have serious pelvic muscle dysfunction and I’m not sure if that could contribute. I tried to reach in and feel the strings and I can’t feel them. My girlfriend is trans, but she hasn’t been taking her hormones consistently for the past few months, she last took her meds two days ago. \n\nYesterday I tried penetration for the first time in two months now. She came pretty quickly, but it took only a few ins-and-outs only halfway (I couldn’t handle more) and she pulled put well before she came. I hovered over at a distance. None of it touched me unless there was potential precum when she penetrated me. \n\nAny chance I could be pregnant or my IUD could be dislodged? Sorry if I seem paranoid here, due to said health issues this is a huge deal for me'
This was really interesting to me, and called into question some of cis-gendered heteronormative assumptions. I assumed, only heterosexual coulpes need to worry about birthcontrol, but I didn't consider - what about trans individuals? I still feel comfortable that the majority of posts fit into these general strucutures, but this was an important reminder of the giant asterisks on this project: making heteronormative assumptions.
male.text[274]
"My friend is quite worried, as her Skyle IUD has expired 2 months ago. However, last week she had unprotected sex, and the guy ejaculated a little bit inside of her before completely pulling out/finishing. \n\nThis was also day 4 or 5 of her period. There wasn't too much blood she recalls.\n\n1. Does skyla continue to work after its claimed expiration date, like Mirena does? I understand there may not be enough research on this yet. \n2. Does having an IUD physically present in the uterus, (expired or not), already have a large effect on blocking sperm?\n\nAny insight or ideas would be highly appreciated here, as she is quite stressed/worried at the moment. \n\nThank you in advance. \n\n\n"
Here's another error, but this one seems like something that might come up more frequently. I assumed that if someone was posting in the third person, it would be for their signficant other. But, what about friends posting for other friends? This seems like something that might come up a lot. Because of this, I looked for "friend" in all the male posts. Again, I left it in to demonstrate what I did, but commented the code out because it printed so much.
count = 0
for x in male.index:
if ' friend ' in male.text[x]:
#used spaces around friend to avoid flaggins girl/boyfriend
#print(x)
#print(male.text[x])
pass
#commented these out becuase the output was long, there were 14 "friend" results that I went through, dropping
#215,#274, #2137 #2612 #3693 #8713
I ended up finding 14 posts that had " friend " in them, and of those 6 that the friend was involved in the actual events of the posts. 215 is a good example, asking about a friend's iud situation, but with no indication that the poster is a male. The other posts that I did not remove involved situations where the friend was a female partner.
male.head()
| author | id | text | time | date | year | Pronouns_F | Pronouns_M | Words_F | Words_M | Poster_Gender | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 11 | Anonthealex | birthcontrol_817j1w_post | I had sex with my now ex girlfriend on the 5th... | 2018-03-01 18:35:47 | 2018-03-01 | 2018 | 13 | 6 | 0 | 1 | 1 |
| 34 | ttty23 | birthcontrol_81eesc_post | Hey guys so my gf has been on the pill for 5 m... | 2018-03-02 14:05:17 | 2018-03-02 | 2018 | 4 | 19 | 0 | 2 | 1 |
| 215 | alexxiskayla | birthcontrol_83arbl_post | My friend has been on birth control for a year... | 2018-03-09 22:29:35 | 2018-03-09 | 2018 | 3 | 9 | 0 | 0 | 1 |
| 234 | HoldenMegroin87 | birthcontrol_83eu92_post | So im looming for some info and insight for my... | 2018-03-10 12:18:31 | 2018-03-10 | 2018 | 4 | 5 | 0 | 1 | 1 |
| 258 | Bestojojo | birthcontrol_83maky_post | My gf and I had unproptected sex in the first ... | 2018-03-11 11:52:14 | 2018-03-11 | 2018 | 3 | 4 | 0 | 1 | 1 |
Remove the "friend" posts:
male1 = male.drop([215,274,2137,2612,3693,8713])
Sample 503 rows from the female data, so that we can reasonably compare male and female posts. Otherwise, we would have a very unbalanced dataset.
female1 = female.sample(n=503)
Create two datasets to work with, so that I don't have to re-run all the code every time I work on this
male1.to_csv('male.csv')
female1.to_csv('female.csv')
The two datasets have been created!!
Part 3: Remove Pronouns and Relationship Words from the Texts & Create a Matrix of Word Embeddings¶
Merge the two dataframes back together so we don't have to do everything twice:
frames = [male1,female1]
final = pd.concat(frames)
Re-index (index values were strange because data came from two dataframes)
final = final.reset_index()
final.drop(columns = ['index'])
| author | id | text | time | date | year | Pronouns_F | Pronouns_M | Words_F | Words_M | Poster_Gender | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Anonthealex | birthcontrol_817j1w_post | I had sex with my now ex girlfriend on the 5th... | 2018-03-01 18:35:47 | 2018-03-01 | 2018 | 13 | 6 | 0 | 1 | 1 |
| 1 | ttty23 | birthcontrol_81eesc_post | Hey guys so my gf has been on the pill for 5 m... | 2018-03-02 14:05:17 | 2018-03-02 | 2018 | 4 | 19 | 0 | 2 | 1 |
| 2 | HoldenMegroin87 | birthcontrol_83eu92_post | So im looming for some info and insight for my... | 2018-03-10 12:18:31 | 2018-03-10 | 2018 | 4 | 5 | 0 | 1 | 1 |
| 3 | Bestojojo | birthcontrol_83maky_post | My gf and I had unproptected sex in the first ... | 2018-03-11 11:52:14 | 2018-03-11 | 2018 | 3 | 4 | 0 | 1 | 1 |
| 4 | PM_ELEPHANTS | birthcontrol_83mhpz_post | So, yesterday I lost my virginity with my girl... | 2018-03-11 12:43:04 | 2018-03-11 | 2018 | 8 | 1 | 0 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1001 | devinhughesxdx | birthcontrol_fhrnw9_post | Can I use spermicide with my Paragard copper i... | 2020-03-13 01:29:42 | 2020-03-13 | 2020 | 2 | 0 | 0 | 0 | 0 |
| 1002 | SnooMachines1497 | birthcontrol_m0xs6p_post | I skipped my placebo pills and went straight t... | 2021-03-09 03:33:52 | 2021-03-09 | 2021 | 4 | 0 | 0 | 0 | 0 |
| 1003 | nylc1106 | birthcontrol_thlkl7_post | So I got my IUD 3/4/22 and I expected the pain... | 2022-03-19 02:32:27 | 2022-03-19 | 2022 | 10 | 0 | 0 | 0 | 0 |
| 1004 | useawishrightnow | birthcontrol_m58rj1_post | Okay so i don't really hate Mirena but i think... | 2021-03-15 00:48:25 | 2021-03-15 | 2021 | 8 | 0 | 0 | 0 | 0 |
| 1005 | tcam4life | birthcontrol_tebbm4_post | I’m having weird mucus like discharge very lon... | 2022-03-14 23:50:33 | 2022-03-14 | 2022 | 11 | 0 | 0 | 0 | 0 |
1006 rows × 11 columns
Save the dataframe, so I can submit it with the project
final.to_csv('final.csv')
Remove the pronouns and relationship words! I know that the pronouns and relationship words are super indicative of poster-gender, but that's really un-interesting. The purpose of the project is to go beyond this and explore if we can classfify gender of poster withouth these. The remve pronouns function lemmatizes each post, then returns an embedded vector containing all words that are not stop words, pronouns, or relationship words. It also returns the lemmatized list of words that I will use later.
from nltk.corpus import stopwords
stops = stopwords.words('english')
Modified from get_doc_embeddings function in HW 8
def remove_pronouns(input_string, nlp):
lemmatized = nlp(input_string)
words = []
#create gendered word lists
gender_words = ['boyfriend','husband','bf','girlfriend','wife','gf']
for token in lemmatized:
if token.pos_ != 'PRON' and token.lemma_ not in gender_words and token.pos_ != 'PUNCT' and token not in stops:
words.append(token)
tokens = words
culled = [token for token in tokens if not (token.is_stop or token.is_punct or token.is_space) and token.has_vector]
embed_vector = np.mean([token.vector for token in culled], axis=0)
return embed_vector,words
Create a matrix of the embedded vectors to be used for regression
matrix = []
for x in final.text:
value = remove_pronouns(x,nlp)[0] # call remove pronouns on each of the post texts
matrix.append(value)
matrix = np.array(matrix) #turn into a np.array
scaled = StandardScaler().fit_transform(matrix) #standard scale
Create another column, for later use, that has all the lemmatized words in a list
final['tokens'] = ''
for x in final.index:
final.tokens[x] = remove_pronouns(final.text[x],nlp)[1]
The final dataframe!! With the new column that has the lemmatized tokens (to be used later)
final.head()
| index | author | id | text | time | date | year | Pronouns_F | Pronouns_M | Words_F | Words_M | Poster_Gender | tokens | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11 | Anonthealex | birthcontrol_817j1w_post | I had sex with my now ex girlfriend on the 5th... | 2018-03-01 18:35:47 | 2018-03-01 | 2018 | 13 | 6 | 0 | 1 | 1 | [sex, ex, 5th, January, , Worried, previous, ... |
| 1 | 34 | ttty23 | birthcontrol_81eesc_post | Hey guys so my gf has been on the pill for 5 m... | 2018-03-02 14:05:17 | 2018-03-02 | 2018 | 4 | 19 | 0 | 2 | 1 | [Hey, guys, pill, 5, months, recently, missed,... |
| 2 | 234 | HoldenMegroin87 | birthcontrol_83eu92_post | So im looming for some info and insight for my... | 2018-03-10 12:18:31 | 2018-03-10 | 2018 | 4 | 5 | 0 | 1 | 1 | [So, looming, info, insight, 25f, So, around, ... |
| 3 | 258 | Bestojojo | birthcontrol_83maky_post | My gf and I had unproptected sex in the first ... | 2018-03-11 11:52:14 | 2018-03-11 | 2018 | 3 | 4 | 0 | 1 | 1 | [unproptected, sex, first, week, january, took... |
| 4 | 259 | PM_ELEPHANTS | birthcontrol_83mhpz_post | So, yesterday I lost my virginity with my girl... | 2018-03-11 12:43:04 | 2018-03-11 | 2018 | 8 | 1 | 0 | 1 | 1 | [So, yesterday, lost, virginity, two, years, w... |
Part 4: The Big Moment!! Logistic Regression!¶
I will use the embedded_vector matrix (based on the text without pronouns or relationship words) to predict the gender of poster. I used the sklearn model and increased max_iter to 10000 because it was not converging.
from sklearn.model_selection import cross_val_score
y = final.Poster_Gender.astype('int')
cross_score = np.mean(cross_val_score(LogisticRegression(max_iter = 10000), scaled,y , scoring='accuracy', cv=5))
print("The proper cross validated score: " + str(cross_score))
The proper cross validated score: 0.8359588197625731
I tried to use feature selection to improve the model, although it decreased the accuracy. It might be worthwhile to further explore using less features (and avoiding over-fitting the model to this specific dataset) in the future, but for now I am focusing on accuracy.
from sklearn.feature_selection import SelectKBest, mutual_info_classif
final_features2 = SelectKBest(score_func=mutual_info_classif, k=50)
best2 = final_features2.fit_transform(scaled,y)
cross_score2 = np.mean(cross_val_score(LogisticRegression(), best2, y, scoring='accuracy', cv=10))
print(cross_score2)
0.8131584158415842
The accuracy score using k= 50 features
In order to improve the logistic regression, I used Gridsearch to find the best parameters. The code is taken from lecture and my code in HW 8.
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import warnings
%%time
# Grid search: wide vs. deep, and compare solvers
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, train_test_split
import warnings
params = {
'multi_class': ['auto', 'ovr', 'multinomial'],
'penalty':['none', 'l2','l1','elasticnet'],
'solver':['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga'],
'max_iter':[2000] # not part of the search, but set a classifier parameter
}
clf = GridSearchCV(LogisticRegression(max_iter = 1000), params, n_jobs=-1)
# work with a subset of the data, to speed things up
X_train, X_test, y_train, y_test = train_test_split(scaled, y, train_size=0.8)
# perform grid search
with warnings.catch_warnings() as w:
warnings.simplefilter("ignore")
clf.fit(X_train, y_train) # Note subset of the data!
CPU times: user 2.4 s, sys: 102 ms, total: 2.5 s Wall time: 1min 29s
clf.best_params_
{'max_iter': 2000, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'saga'}
The best parameters for logistic regression, as determined by the gridsearch
cross_score2 = np.mean(cross_val_score(LogisticRegression(max_iter = 2000, multi_class = 'auto', penalty = 'l2', solver = 'saga'), scaled, y, scoring='accuracy', cv=10))
print("The cross validated score: " + str(cross_score2))
The cross validated score: 0.841990099009901
This improved the accuracy!! Yay!
This is the conclusion of the work I set out to do. I am so happy that I was able to follow through on my plan, and utilize the tools we learned in class with real data. The following section is some more basic code that I was discouraged from structuring my project around, but that I was curious about and want to explore a little.
Part 5: Bonus / Fun Little Explorations¶
A. Word Clouds¶
Use the modified print_info function to get the entire word/frequency dict for both male and female posts.
female_words = print_info(female1,'corpus')
male_words = print_info(male1, 'corpus')
Imports for a word cloud!!
import matplotlib.pyplot as plt
import matplotlib as mpl
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
Convert format the dictionary correctly
for x in female_words:
female_words[x] = female_words[x][0]
wordcloud = WordCloud(min_word_length =3,
background_color='white', colormap = 'Reds_r')
# generate the word cloud
wordcloud.generate_from_frequencies(female_words)
#plot
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Female WordCloud
for x in male_words:
male_words[x] = male_words[x][0]
wordcloud = WordCloud(min_word_length =3,
background_color='white',colormap = 'Blues')
# generate the word cloud
wordcloud.generate_from_frequencies(male_words)
#plot
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Male WordCloud
The size of the word "sex" in the male vs female cloud really stands out here. It even took me a while to find it in the female cloud.
B. T-Test Word Lists¶
from scipy.stats import ttest_rel
common_words = set.intersection(set(male_words.keys()), set(female_words.keys()))
t,p = ttest_rel( [female_words[k] for k in common_words],[male_words[k] for k in common_words])
print("The t-statistic is: " + str(t)+ "\nThe p-value is: " + str(p))
#https://stackoverflow.com/questions/16892486/python-program-to-perform-t-test-on-frequency-list
The t-statistic is: 6.154906342348895 The p-value is: 8.894587120590231e-10
The difference is statistically signifcant, although I should note that the t-test assumes normal distribution and I do not know if my data meets this expectation.
C. Sentiment Scroring¶
Read emolex function taken from lecture
# A freebie helper function to read and parse the emolex file
from collections import defaultdict
emolex_file = 'emolex.txt'
def read_emolex(filepath=None):
'''
Takes a file path to the emolex lexicon file.
Returns a dictionary of emolex sentiment values.
'''
if filepath==None: # Try to find the emolex file
filepath = os.path.join('..','..','data','lexicons','emolex.txt')
if os.path.isfile(filepath):
pass
elif os.path.isfile('emolex.txt'):
filepath = 'emolex.txt'
else:
raise FileNotFoundError('No EmoLex file found')
emolex = defaultdict(dict) # Like Counter(), defaultdict eases dictionary creation
with open(filepath, 'r') as f:
# emolex file format is: word emotion value
for line in f:
word, emotion, value = line.strip().split()
emolex[word][emotion] = int(value)
return emolex
# Get EmoLex data. Make sure you set the right file path above.
emolex = read_emolex(emolex_file)
Sentiment score fucntion (modified from my HW):
def sentiment_score(tokens, lexicon = emolex):
sentiments = dict.fromkeys(lexicon['aback'].keys())
#making scoreing dictionary
for k in sentiments:
sentiments[k] = 0
sent_len = len(tokens)
for x in range(sent_len):
word = tokens[x]
new_dict = emolex[word]
for a in new_dict:
sentiments[a]+=new_dict[a]
#Normalize
for k in sentiments:
sentiments[k] = (round(sentiments[k]/sent_len,3))
return(sentiments)
Change the elements in the list to strings
for y in final.index:
new = []
for x in final.tokens[y]:
new.append(str(x))
final.tokens[y] = new
Create Dataframes for Male and Female Poster Emotions, then fill them with sentiment scores
female_emotions = pd.DataFrame(columns=['gender','anger', 'anticipation','disgust','fear', 'joy', 'negative','positive','sadness', 'surprise','trust'])
male_emotions = pd.DataFrame(columns=['gender','anger', 'anticipation','disgust','fear', 'joy', 'negative','positive','sadness', 'surprise','trust'])
for x in final.index:
if final.Poster_Gender[x] == 1:
sent = sentiment_score(final.tokens[x])
male_emotions.loc[x] = final.Poster_Gender[x],sent['anger'],sent['anticipation'],sent['disgust'],sent['fear'],sent['joy'],sent['negative'],sent['positive'],sent['sadness'],sent['surprise'],sent['trust']
else:
sent = sentiment_score(final.tokens[x])
female_emotions.loc[x] = final.Poster_Gender[x],sent['anger'],sent['anticipation'],sent['disgust'],sent['fear'],sent['joy'],sent['negative'],sent['positive'],sent['sadness'],sent['surprise'],sent['trust']
male_emotions.head()
| gender | anger | anticipation | disgust | fear | joy | negative | positive | sadness | surprise | trust | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.000 | 0.072 | 0.029 | 0.029 | 0.043 | 0.072 | 0.043 | 0.029 | 0.014 | 0.072 |
| 1 | 1.0 | 0.008 | 0.070 | 0.000 | 0.039 | 0.031 | 0.085 | 0.085 | 0.047 | 0.008 | 0.062 |
| 2 | 1.0 | 0.000 | 0.054 | 0.000 | 0.036 | 0.036 | 0.036 | 0.036 | 0.018 | 0.000 | 0.036 |
| 3 | 1.0 | 0.000 | 0.111 | 0.000 | 0.000 | 0.056 | 0.083 | 0.111 | 0.056 | 0.000 | 0.111 |
| 4 | 1.0 | 0.016 | 0.031 | 0.000 | 0.047 | 0.000 | 0.062 | 0.062 | 0.047 | 0.031 | 0.062 |
Find the mean of each emotion for each male and female set, then print the means
import stat
m_emo = {}
f_emo = {}
for col in male_emotions.columns:
m_emo[col] = male_emotions[col].mean()
for col in female_emotions.columns:
f_emo[col] = female_emotions[col].mean()
print(m_emo)
print(f_emo)
{'gender': 1.0, 'anger': 0.010556660039761422, 'anticipation': 0.06117296222664019, 'disgust': 0.017646123260437352, 'fear': 0.035707753479125234, 'joy': 0.03728031809145126, 'negative': 0.05875149105367803, 'positive': 0.07716103379721675, 'sadness': 0.032047713717693826, 'surprise': 0.011101391650099393, 'trust': 0.06385288270377744}
{'gender': 0.0, 'anger': 0.015487077534791241, 'anticipation': 0.052982107355864845, 'disgust': 0.02133598409542741, 'fear': 0.03894234592445329, 'joy': 0.030206759443339937, 'negative': 0.05649105367793238, 'positive': 0.07070775347912524, 'sadness': 0.03418290258449304, 'surprise': 0.012723658051689844, 'trust': 0.05784294234592448}
Higher sentiment score in male are 'anticipation', 'joy', 'negative', 'positive', 'trust'
Higher sentiment score in female are 'anger', 'disgust', 'fear', 'sadness', 'surprise'
Discussion & Conclusion:¶
Limitations¶
This study was limited by assumptions, scale, and scope.
The major assumption that all couples were cis-gendered and heteronormativite disregarded communities that may encounter birth control struggles in different ways. The example from male.text[118] exemplifies this issue, as the poster was a female who was concerned that she was pregnant by her girlfriend who is trans. Though cases like these may seem rare, research that quickly disregards situations like these perpetuate inequality, especially when marginalized groups are involved. If I were to publish research of this sort, I would be sure to further explore how to accurately represent the experiences of all identities and relationship types.
The scale of the project was also limited. Though the 16,554 posts I started with seemed like more than large enough of a dataset, only finding 503 male-authored posts (and then sampling 503 female posts to match this amount) left a relatively small dataset. In the future, I would scrape a larger collection to begin with (by scraping more years), so that once I narrowed to only male posts, I would still have a larger amount of data.This issue also could have been addressed by keeping the larger number of female posts, and then working with models that can account for unbalanced datasets, like BalancedRandomForestClassifer.
The scope of the project was also limited. In an ideal world, I would have been able to dig deeper into the features and determine not only that the classifier could distinguish between male and female posts, but also why. The small experiments at the end provide some insight into this - men and women use different words with different sentiments. This would be an interesting direction to further explore. It would also have been interesting to dig deeper into the errors, as some of them are likely errors in gold-labeling rather than prediction. Exploring this would help understand the shortcomings of the Logisitic Regression Model, or even could elucidate some factors that I disregarded in the gold labeling.
Discussion¶
The project was overall successful. From an embedded_vector matrix, I was able to predict the gender of the poster with 84% accuracy. This indicates that males on r/birthcontrol do post in a markedly different way than females, even aside from clearly indicative factors like pronouns or relationship words. There are a few reasonable explanations for this finding. One is that men, in general, make different language choices than females regardless of the topic. In this course, we have explored differences between male and female authors, so broader explanations of word choice and style by gender as a whole may come into play here. The second explanation is that the situation that drives men to post on r/birthcontrol might overall be different than the situations where females post. For example, a female might post about cramping, acne, or other side effects, while these events are not happening to males and they might be less intune to them. Even if a male was concerned about, say, his girlfriend’s acne, him posting about it on reddit would certainly feel like overstepping. With this, more of the male posts are likely to relate to an event that directly involves the male, making male posts potentially more focused on pregnancy scares or urgent advice. Likely, the actual predictive accuracy relies upon a combination of these two factors, with both the language style of the gender and the role of the man in a birth control discussion playing a role.
From the word cloud, it is obvious that men use the word “sex” much more frequently than women do. They also use the word “days” more frequently, while women use “months” more frequently. Words associated with symptoms of birth control like “spotting” and “cramps” are present in the female word cloud, but not the male. An interesting explanation for these findings would be that birth control is a concern to men only in the days preceding or following sex, while the concern is more constant and ongoing for women, involving discussion of relevant symptoms and experiences outside of sex itself.
The sentiment scoring also provided insight into the differences in the text of the male-authored vs. female-authored posts. The scores that were higher for men than women were 'anticipation', 'joy', 'negative', 'positive', 'trust', while the scores that were higher for women were 'anger', 'disgust', 'fear', 'sadness', 'surprise.’ For the male posts, we already noted that males use the word sex more, which is associated with “joy” and anticipation” sentiments. Because women are the ones actually experiencing side effects from almost all types of birth control, I was not surprised that the sad/anger/disgust words were more associated with female posts. Another factor that may play into this is that females are more likely to post about other negative concerns other than pregnancy scares, like side effects or implantation / insertion experiences of long acting reversible birth controls like iuds or implants. These experiences would be less likely to include words with positive sentiments like sex, and include words describing pain. In discussions of birth control, men write in a way that is overall more positive, while females who actually experience both pregnancies and birth control effects, post with more fear/disgust/sadness language.
Conclusion¶
Although the amount of posts on r/birthcontrol is not necessarily an ideal proxy for how involved males are in discussions of birth control, it's a little disheartening how few of the posts are from men (3%). All in all, the differences between male and female authored posts on reddit were apparent, and using text mining was an enlightening way to extract information from and compare hundreds of posts. I appreciate the opportunity to explore a topic of interest to me and apply the methods learned in class this semester to this project. Thank you, and have a great summer!