import sys
!{sys.executable} -m pip install psaw little_mallet_wrapper Levenshtein

Collecting psaw
  Downloading psaw-0.1.0-py3-none-any.whl (15 kB)
Collecting little_mallet_wrapper
  Downloading little_mallet_wrapper-0.5.0-py3-none-any.whl (19 kB)
Collecting Levenshtein
  Downloading Levenshtein-0.18.1-cp39-cp39-macosx_10_9_x86_64.whl (242 kB)
     |████████████████████████████████| 242 kB 2.8 MB/s eta 0:00:01
Requirement already satisfied: Click in /opt/anaconda3/envs/3350/lib/python3.9/site-packages (from psaw) (8.0.3)
Requirement already satisfied: requests in /opt/anaconda3/envs/3350/lib/python3.9/site-packages (from psaw) (2.27.1)
Collecting rapidfuzz<3.0.0,>=2.0.1
  Downloading rapidfuzz-2.0.11-cp39-cp39-macosx_10_9_x86_64.whl (1.6 MB)
     |████████████████████████████████| 1.6 MB 2.8 MB/s eta 0:00:01
Collecting jarowinkler<1.1.0,>=1.0.2
  Downloading jarowinkler-1.0.2-cp39-cp39-macosx_10_9_x86_64.whl (72 kB)
     |████████████████████████████████| 72 kB 1.9 MB/s eta 0:00:01
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/envs/3350/lib/python3.9/site-packages (from requests->psaw) (2021.10.8)
Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/envs/3350/lib/python3.9/site-packages (from requests->psaw) (3.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/anaconda3/envs/3350/lib/python3.9/site-packages (from requests->psaw) (1.26.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /opt/anaconda3/envs/3350/lib/python3.9/site-packages (from requests->psaw) (2.0.4)
Installing collected packages: jarowinkler, rapidfuzz, psaw, little-mallet-wrapper, Levenshtein
Successfully installed Levenshtein-0.18.1 jarowinkler-1.0.2 little-mallet-wrapper-0.5.0 psaw-0.1.0 rapidfuzz-2.0.11

!{sys.executable} -m pip install tomotopy

Collecting tomotopy
  Downloading tomotopy-0.12.2-cp39-cp39-macosx_10_14_x86_64.whl (14.5 MB)
     |████████████████████████████████| 14.5 MB 8.1 MB/s eta 0:00:01    |██▋                             | 1.2 MB 901 kB/s eta 0:00:15     |█████████                       | 4.1 MB 4.9 MB/s eta 0:00:03
Requirement already satisfied: numpy>=1.11.0 in /opt/anaconda3/envs/3350/lib/python3.9/site-packages (from tomotopy) (1.21.2)
Installing collected packages: tomotopy
Successfully installed tomotopy-0.12.2

#importing necessary packages
from datetime import datetime
import os
import glob
import pandas as pd
from psaw import PushshiftAPI

base_path = os.path.join('reddit_data')  # creating a directory for the data
if not os.path.exists(base_path):  # if it does not exist
    os.makedirs(base_path)         # create it

""" Maria Antoniak's code with minor modifications """
def scrape_posts_from_subreddit(subreddit, api, year, month, end_date):
    '''
    Takes the name of a subreddit, the PushshiftApi, a year and month to scrape from
    '''
    start_epoch = int(datetime(year, month, 1).timestamp())  # convert date into unicode timestamp
    end_epoch = int(datetime(year, month, end_date).timestamp())

    gen = api.search_submissions(after=start_epoch,
                                 before=end_epoch,
                                 subreddit=subreddit,
                                 filter=['url', 'author', 'created_utc',  # info we want about the post
                                         'title', 'subreddit', 'selftext',
                                         'num_comments', 'score', 'link_flair_text', 'id'])

    max_response_cache = 100000
    scraped_posts = []
    for _post in gen:
        scraped_posts.append(_post)
        if len(scraped_posts) >= max_response_cache:  # avoid requesting more posts than allowed
            break

    scraped_posts_df = pd.DataFrame([p.d_ for p in scraped_posts])

    return scraped_posts_df

""" Maria Antoniak's code with minor modifications """
def scrape_comments_from_subreddit(subreddit, api, year, month, end_date):
    '''
    Takes the name of a subreddit, the PushshiftApi, a year and month to scrape from
    '''
    start_epoch = int(datetime(year, month, 1).timestamp())  # convert date into unicode timestamp
    end_epoch = int(datetime(year, month, end_date).timestamp())

    gen = api.search_comments(after=start_epoch,
                              before=end_epoch,
                              subreddit=subreddit,
                              filter=['author', 'body', 'created_utc', # info we want about the comment
                                      'id', 'link_id', 'parent_id',
                                      'reply_delay', 'score', 'subreddit'])

    max_response_cache = 100000
    scraped_comments = []
    for _comment in gen:
        scraped_comments.append(_comment)
        if len(scraped_comments) >= max_response_cache:  # avoid requesting more posts than allowed
            break
    scraped_comments_df = pd.DataFrame([p.d_ for p in scraped_comments])

    return scraped_comments_df

""" Maria Antoniak's code with minor modifications """
def scrape_subreddit(_target_subreddits, _target_types, _years):
    '''
    Takes a list of subreddits, a list of types of content to scrape, and a list of years to scrape from
    '''
    
    api = PushshiftAPI()

    print('Number of PushshiftApi shards that are not working:', api.metadata_.get('shards'))  # check if any Pushshift shards are down!
    
    for _subreddit in _target_subreddits:
        for _target_type in _target_types:
            for _year in _years:
                if _year < 2022:
                    months = [3, 4]
                    end_dates = [31, 30]
                elif _year == 2022:
                    months = [3, 4]  # months to scrape
                    end_dates = [31, 30]  # last day of the month

                for _month, _end_date in zip(months, end_dates):
                    _output_directory_path = os.path.join(base_path, _subreddit, _target_type)  # directory to store scraped data
                                                                                                # by subreddit and type of content
                    if not os.path.exists(_output_directory_path):  # if it does not exist
                        os.makedirs(_output_directory_path)         # create it!

                    _file_name = _subreddit + '-' + str(_year) + '-' + str(_month) + '.pkl'  # filename of the csv with scraped data

                    # scrape only if output file does not already exist
                    if _file_name not in os.listdir(_output_directory_path):

                        print(str(datetime.now()) + ' ' + ': Scraping r/' + _subreddit + ' ' + str(_year) + '-' + str(_month) + '...')

                        if _target_type == 'posts':
                            _posts_df = scrape_posts_from_subreddit(_subreddit, api, _year, _month, _end_date)
                            if not _posts_df.empty:
                                _posts_df.to_pickle(os.path.join(_output_directory_path, _file_name), protocol=4)

                        if _target_type == 'comments':
                            _comments_df = scrape_comments_from_subreddit(_subreddit, api, _year, _month, _end_date)
                            if not _comments_df.empty:
                                _comments_df.to_pickle(os.path.join(_output_directory_path, _file_name), protocol=4)

    print(str(datetime.now()) + ' ' + ': Done scraping!')

target_subreddits = ['birthcontrol']  # subreddits to scrape
target_types = ['posts']  # type of content to scrape
years = [2018,2019,2020,2021,2022]  # years to scrape
scrape_subreddit(target_subreddits, target_types, years)

Number of PushshiftApi shards that are not working: None
2022-05-16 14:45:44.138515 : Scraping r/birthcontrol 2018-3...

2022-05-16 14:45:48.948862 : Scraping r/birthcontrol 2018-4...

2022-05-16 14:45:57.547762 : Scraping r/birthcontrol 2019-3...

2022-05-16 14:46:21.991454 : Scraping r/birthcontrol 2019-4...

2022-05-16 14:46:32.819431 : Done scraping!

def combine_one_subreddit(_subreddit):  # creating csv with all of a subreddit's posts and comments

    df_d = {'author': [], 'id': [], 'type': [], 'text': [],   # create a dictionary
            'url': [], 'link_id': [], 'parent_id': [],
            'subreddit': [], 'created_utc': []}
    
    subreddit_pkl_path = os.path.join('reddit_data', _subreddit, f'{_subreddit}.pkl') # file with all the data
    if not os.path.exists(subreddit_pkl_path):  # if file does not exist
        
        for target_type in ['posts']:
            files_directory_path = os.path.join('reddit_data', _subreddit, target_type)  # directory where scraped data is depending on subreddit and type of content
            all_target_type_files = glob.glob(os.path.join(files_directory_path, "*.pkl"))  # select all appropriate pickle files
            for f in all_target_type_files:  # we read each pickle file and include the info we want in the dictionary
                df = pd.read_pickle(f)


                if target_type == 'posts':
                    for index, row in df.iterrows():
                        df_d['author'].append(row['author'])
                        df_d['id'].append(f"{row['subreddit']}_{row['id']}_post")  # id of the post, 'Endo_xyz123_post'
                        df_d['type'].append('post')
                        df_d['text'].append(row['selftext'])  # textual content of the post
                        df_d['url'].append(row['url'])  # url of the post
                        df_d['link_id'].append('N/A')
                        df_d['parent_id'].append('N/A')
                        df_d['subreddit'].append(row['subreddit'])
                        df_d['created_utc'].append(row['created_utc'])  # utc time stamp of the post



        subreddit_df = pd.DataFrame.from_dict(df_d)  # create pandas dataframe from dictionary
        subreddit_df.sort_values('created_utc', inplace=True, ignore_index=True)  # order dataframe by date of post
        subreddit_df['time'] = pd.to_datetime(subreddit_df['created_utc'], unit='s').apply(lambda x: x.to_datetime64())  # convert timestamp to date
        subreddit_df['date'] = subreddit_df['time'].apply(lambda x: str(x).split(' ')[0])
        subreddit_df['year'] = subreddit_df['time'].apply(lambda x: str(x).split('-')[0])
        subreddit_df.drop(columns=['time'])
        
        subreddit_df.to_pickle(subreddit_pkl_path, protocol=4)  # saving it to pickle format

for subreddit in target_subreddits:
    combine_one_subreddit(subreddit)

import re
from collections import Counter
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

df = pd.read_pickle(os.path.join('reddit_data', 'birthcontrol', 'birthcontrol.pkl'))
df = df.dropna()
print(len(df))

16554

def print_info(df, _type):
    if _type != 'corpus':
        vectorizer = CountVectorizer(        # Token counts with stopwords
            input = 'content',               # input is a string of texts
            encoding = 'utf-8',
            strip_accents = 'unicode',
            lowercase = True
        )

        texts = df['text'].astype('string').tolist()
        X = vectorizer.fit_transform(texts)
        print(f"Total vectorized words in the corpus of {_type}:", X.sum())
        print(f"Average vectorized {_type} length:", int(X.sum()/X.shape[0]), "tokens")
    
    else:
        vectorizer = CountVectorizer(
            input = 'content',
            encoding = 'utf-8',
            strip_accents = 'unicode',
            lowercase = True,
            stop_words = 'english'          # remove stopwords
        )
        
        texts = df['text'].astype('string').tolist()
        X = vectorizer.fit_transform(texts)
        sum_words = X.sum(axis=0)
        words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
        
        words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
        
        word_dict = {}
        for a, b in words_freq:
            word_dict.setdefault(a, []).append(b)

        return word_dict
#https://www.geeksforgeeks.org/python-convert-list-tuples-dictionary/

def find_duplicates(_df):  # function to find duplicated posts in the data

    prev_doc = ''
    map_dict = {}  # dict of authors' posts
    duplicate_indexes = []  # list of duplicates' indexes for removal from dataframe
    for index, row in _df.iterrows():  # iterate over posts
        author = row['author']
        doc = row['text']

        # if author info is available we compare each post with previous ones by the same author
        # we compare/calculate the similarity between the posts using the Levenshtein distance
        if author != '[deleted]':
            if author in map_dict.keys():
                flag = 0
                idx = 0
                while idx < len(map_dict[author]) and flag == 0:
                    lev = Levenshtein.ratio(doc, map_dict[author][idx])
                    if lev > 0.99:
                        duplicate_indexes.append(index)
                        flag = 1
                    idx += 1
                if flag == 0:
                    map_dict[author].append(doc)
            else:
                map_dict[author] = [doc]

        # if author info is not available we compare each post with the preceding one chronologically
        else:
            lev = Levenshtein.ratio(row['text'], prev_doc)
            if lev > 0.90:
                duplicate_indexes.append(index)

        prev_doc = doc

    return duplicate_indexes

dupes = find_duplicates(df)  # find duplicates
df.drop(dupes, inplace=True)  # removing duplicates
print(f'Number of duplicates: {len(dupes)}')

Number of duplicates: 145

def cleaning_docs(raw_df, _subreddit):
    '''
    Takes the full corpus, a file path. It cleans all the documents (removes punctuation and stopwords). It saves the clean corpus in a json file
    '''
    clean_docs_file = os.path.join('reddit_data', _subreddit, f'clean_{_subreddit}.pkl')
    if not os.path.exists(clean_docs_file): 
        
        clean_d = {'id':[], 'clean':[], 'og':[], 'year':[], 'date':[]}

        for index, row in raw_df.iterrows():                               # iterating over posts and comments
            if 'bot' not in row['author'] and 'Bot' not in row['author']:  # if author is not a bot
                clean_doc_st = lmw.process_string(row['text'])             # cleaning documents
                clean_doc_l = [t for t in clean_doc_st.split(' ')]
                if len(set(clean_doc_l))>5 and 'bot' not in clean_doc_l:  # exclude posts that have less than 5 different words
                                                                          # or that contain word 'bot'
                    clean_d['clean'].append(clean_doc_l)
                    clean_d['id'].append(row['id'])
                    clean_d['og'].append(row['text'])
                    clean_d['year'].append(row['year'])
                    clean_d['date'].append(row['date'])

        with open(clean_docs_file, 'w') as jsonfile:  # creating a file with the dict of documents to topic model
            json.dump(clean_d, jsonfile)

import json
import little_mallet_wrapper as lmw
import Levenshtein

%%time
for subreddit in target_subreddits:
        cleaning_docs(df, subreddit)

CPU times: user 6.15 s, sys: 177 ms, total: 6.33 s
Wall time: 6.53 s

df.head() #take a look at the data

posts = df[df.type == 'post']
posts1 = posts.reset_index()
posts2 = posts1.drop(columns = ['url','link_id','parent_id','created_utc','index','type','subreddit'])

posts2.head()

import spacy 
nlp = spacy.load("en_core_web_lg")

def get_pronouns(input_string, nlp):
    
    
    lemmatized = nlp(input_string)

    #create pronoun lists 
    f_speaker = ['I','me','my','mine','his','himself','he']
    fs_list = []
    m_speaker = ['she', 'her', 'hers', 'herself']
    ms_list = []
    
    #create gendered word lists 
    f_words = ['boyfriend','husband','bf']
    fw = []
    m_words = ['girlfriend','wife','gf']
    mw = []
    #add token to appropriate lists 
    for token in lemmatized: 
        if token.pos_ == 'PRON':
            if token.lemma_ in f_speaker:
                fs_list.append(token.lemma_)
            elif token.lemma_ in m_speaker:
                ms_list.append(token.lemma_)
        else:
            if token.lemma_ in m_words:
                mw.append(token.lemma_)
            elif token.lemma_ in f_words:
                fw.append(token.lemma_)

            
            
    return fs_list,ms_list,fw,mw

#create new columns for the word list counts we made
posts2['Pronouns_F'] = 0
posts2['Pronouns_M'] = 0 
posts2['Words_F'] = 0
posts2['Words_M'] = 0 
posts2['Poster_Gender'] = None

posts2.head() # display the dataframe with new columns

posts_small = posts2
#this was my tester but now I'm just using the whole thing

string_df = posts1.text[1]
for x in range(len(posts_small['text'])): 
    result = get_pronouns(posts2.text[x],nlp)
    posts_small.Pronouns_F[x] = len(result[0])
    posts_small.Pronouns_M[x] = len(result[1])
    posts_small.Words_F[x] = len(result[2])
    posts_small.Words_M[x] = len(result[3])

posts_small.head()

no_gender = []
for x in range(len(posts_small)): 
    if posts_small.Words_F[x] == 0 and posts_small.Words_M[x] >0:
        posts_small.Poster_Gender[x] = 1
    elif posts_small.Words_M[x] == 0 and posts_small.Words_F[x] >0:
        posts_small.Poster_Gender[x]  = 0
    elif posts_small.Pronouns_M[x] > posts_small.Pronouns_F[x]:
        posts_small.Poster_Gender[x]  = 1
    elif posts_small.Pronouns_F[x] > posts_small.Pronouns_M[x]:
        posts_small.Poster_Gender[x]  = 0
    else:
        no_gender.append(x)

print(posts_small.text[5])
print(posts_small.text[150])
print(posts_small.text[273])
#also deleted, ads, or equal # male/female words

Has anyone ever taken plan b and MONTHS later felt insane? 
If you take the simulated period pills (placebo), will you miss that simulated period if you're pregnant, just like you would miss a "real" period? 
do the ingredients in mucinex affect nuvaring ?

#check to see if there is something missing here that could indicate gender
#for x in no_gender: 
    #print(posts_small.text[x])
#I explored this myself, but the output is long to look through

posts_small.head()

male = posts_small[posts_small.Poster_Gender == 1]
female = posts_small[posts_small.Poster_Gender == 0]
male.head(10)
print(len(male))

509

male.text[118]

'I have an IUD and I’m worried it might be displaced. I have been feeling pain and depression for a while now but I have the IUD for a uterus illness so that is likely why that exists, it just wasn’t as bad until after I got the IUD put in. \n\nI am aware that more muscle contractions means more likelihood of displacement. I have serious pelvic muscle dysfunction and I’m not sure if that could contribute. I tried to reach in and feel the strings and I can’t feel them. My girlfriend is trans, but she hasn’t been taking her hormones consistently for the past few months, she last took her meds two days ago. \n\nYesterday I tried penetration for the first time in two months now. She came pretty quickly, but it took only a few ins-and-outs only halfway (I couldn’t handle more) and she pulled put well before she came. I hovered over at a distance. None of it touched me unless there was potential precum when she penetrated me. \n\nAny chance I could be pregnant or my IUD could be dislodged? Sorry if I seem paranoid here, due to said health issues this is a huge deal for me'

male.text[274]

"My friend is quite worried, as her Skyle IUD has expired 2 months ago. However, last week she had unprotected sex, and the guy ejaculated a little bit inside of her before completely pulling out/finishing. \n\nThis was also day 4 or 5 of her period. There wasn't too much blood she recalls.\n\n1. Does skyla continue to work after its claimed expiration date, like Mirena does? I understand there may not be enough research on this yet. \n2. Does having an IUD physically present in the uterus, (expired or not), already have a large effect on blocking sperm?\n\nAny insight or ideas would be highly appreciated here, as she is quite stressed/worried at the moment. \n\nThank you in advance. \n\n\n"

count = 0
for x in male.index:

    if ' friend ' in male.text[x]:
        #used spaces around friend to avoid flaggins girl/boyfriend
        #print(x)
        #print(male.text[x])
        pass
#commented these out becuase the output was long, there were 14 "friend" results that I went through, dropping       
#215,#274, #2137 #2612 #3693 #8713

male.head()

male1 = male.drop([215,274,2137,2612,3693,8713])

female1 = female.sample(n=503)

male1.to_csv('male.csv')
female1.to_csv('female.csv')

frames = [male1,female1]
final = pd.concat(frames)

final = final.reset_index()
final.drop(columns = ['index'])

final.to_csv('final.csv')

from nltk.corpus import stopwords
stops = stopwords.words('english')

def remove_pronouns(input_string, nlp):
    
    
    lemmatized = nlp(input_string)
    words = []
    #create gendered word lists 
    gender_words = ['boyfriend','husband','bf','girlfriend','wife','gf']
    for token in lemmatized: 
        if token.pos_ != 'PRON' and token.lemma_ not in gender_words and token.pos_ != 'PUNCT' and token not in stops:
            words.append(token)
    tokens = words    
    culled = [token for token in tokens if not (token.is_stop or token.is_punct or token.is_space) and token.has_vector]
    embed_vector = np.mean([token.vector for token in culled], axis=0)
    
    return embed_vector,words

matrix = []
for x in final.text:
    value = remove_pronouns(x,nlp)[0] # call remove pronouns on each of the post texts
    matrix.append(value)

matrix = np.array(matrix) #turn into a np.array

scaled = StandardScaler().fit_transform(matrix) #standard scale

final['tokens'] = ''
for x in final.index: 
    final.tokens[x] = remove_pronouns(final.text[x],nlp)[1]

final.head()

from   sklearn.model_selection import cross_val_score
y = final.Poster_Gender.astype('int')
cross_score = np.mean(cross_val_score(LogisticRegression(max_iter = 10000), scaled,y , scoring='accuracy', cv=5))
print("The proper cross validated score: " + str(cross_score))

The proper cross validated score: 0.8359588197625731

from   sklearn.feature_selection import SelectKBest, mutual_info_classif

final_features2 = SelectKBest(score_func=mutual_info_classif, k=50)
best2 = final_features2.fit_transform(scaled,y)
cross_score2 = np.mean(cross_val_score(LogisticRegression(), best2, y, scoring='accuracy', cv=10))
print(cross_score2)

0.8131584158415842

from sklearn.model_selection import GridSearchCV
from   sklearn.model_selection import train_test_split
import warnings

%%time
# Grid search: wide vs. deep, and compare solvers
from sklearn.model_selection import GridSearchCV
from   sklearn.model_selection import cross_val_score, train_test_split
import warnings

params = {
    'multi_class': ['auto', 'ovr', 'multinomial'],
    'penalty':['none', 'l2','l1','elasticnet'],
    'solver':['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga'],
    'max_iter':[2000] # not part of the search, but set a classifier parameter
}
clf = GridSearchCV(LogisticRegression(max_iter = 1000), params, n_jobs=-1)

# work with a subset of the data, to speed things up
X_train, X_test, y_train, y_test = train_test_split(scaled, y, train_size=0.8)

# perform grid search
with warnings.catch_warnings() as w:
    warnings.simplefilter("ignore")
    clf.fit(X_train, y_train) # Note subset of the data!

CPU times: user 2.4 s, sys: 102 ms, total: 2.5 s
Wall time: 1min 29s

clf.best_params_

{'max_iter': 2000, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'saga'}

cross_score2 = np.mean(cross_val_score(LogisticRegression(max_iter = 2000, multi_class = 'auto', penalty = 'l2', solver = 'saga'), scaled, y, scoring='accuracy', cv=10))
print("The cross validated score: " + str(cross_score2))

The cross validated score: 0.841990099009901

female_words = print_info(female1,'corpus')

male_words = print_info(male1, 'corpus')

import matplotlib.pyplot as plt
import matplotlib as mpl
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image

for x in female_words:
    female_words[x] = female_words[x][0]

wordcloud = WordCloud(min_word_length =3,
                      background_color='white', colormap = 'Reds_r')

# generate the word cloud
wordcloud.generate_from_frequencies(female_words)

#plot
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

for x in male_words:
    male_words[x] = male_words[x][0]

wordcloud = WordCloud(min_word_length =3,
                      background_color='white',colormap = 'Blues')

# generate the word cloud
wordcloud.generate_from_frequencies(male_words)

#plot
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

from scipy.stats import ttest_rel
common_words = set.intersection(set(male_words.keys()), set(female_words.keys()))

t,p = ttest_rel( [female_words[k] for k in common_words],[male_words[k] for k in common_words])

print("The t-statistic is: " + str(t)+ "\nThe p-value is: " + str(p))
#https://stackoverflow.com/questions/16892486/python-program-to-perform-t-test-on-frequency-list

The t-statistic is: 6.154906342348895
The p-value is: 8.894587120590231e-10

# A freebie helper function to read and parse the emolex file
from   collections import defaultdict
emolex_file = 'emolex.txt'
def read_emolex(filepath=None):
    '''
    Takes a file path to the emolex lexicon file.
    Returns a dictionary of emolex sentiment values.
    '''
    if filepath==None: # Try to find the emolex file
        filepath = os.path.join('..','..','data','lexicons','emolex.txt')
        if os.path.isfile(filepath):
            pass
        elif os.path.isfile('emolex.txt'):
            filepath = 'emolex.txt'
        else:
            raise FileNotFoundError('No EmoLex file found')
    emolex = defaultdict(dict) # Like Counter(), defaultdict eases dictionary creation
    with open(filepath, 'r') as f:
    # emolex file format is: word emotion value
        for line in f:
            word, emotion, value = line.strip().split()
            emolex[word][emotion] = int(value)
    return emolex

# Get EmoLex data. Make sure you set the right file path above.
emolex = read_emolex(emolex_file)

def sentiment_score(tokens, lexicon = emolex):
    sentiments = dict.fromkeys(lexicon['aback'].keys())
    #making scoreing dictionary 
    for k in sentiments:
        sentiments[k] = 0

    sent_len = len(tokens)
    for x in range(sent_len):
        word = tokens[x]
        new_dict = emolex[word]
        for a in new_dict:
            sentiments[a]+=new_dict[a]
        
    #Normalize
    for k in sentiments:
        sentiments[k] = (round(sentiments[k]/sent_len,3))
    
    return(sentiments)

for y in final.index:
    new = []
    for x in final.tokens[y]:
        new.append(str(x))
    final.tokens[y] = new

female_emotions = pd.DataFrame(columns=['gender','anger', 'anticipation','disgust','fear', 'joy', 'negative','positive','sadness', 'surprise','trust'])
male_emotions = pd.DataFrame(columns=['gender','anger', 'anticipation','disgust','fear', 'joy', 'negative','positive','sadness', 'surprise','trust'])

for x in final.index:
    if final.Poster_Gender[x] == 1:
        sent = sentiment_score(final.tokens[x])
        male_emotions.loc[x] = final.Poster_Gender[x],sent['anger'],sent['anticipation'],sent['disgust'],sent['fear'],sent['joy'],sent['negative'],sent['positive'],sent['sadness'],sent['surprise'],sent['trust']
    else:
        sent = sentiment_score(final.tokens[x])
        female_emotions.loc[x] = final.Poster_Gender[x],sent['anger'],sent['anticipation'],sent['disgust'],sent['fear'],sent['joy'],sent['negative'],sent['positive'],sent['sadness'],sent['surprise'],sent['trust']
        
male_emotions.head()

import stat

m_emo = {}
f_emo = {}

for col in male_emotions.columns: 
    m_emo[col] = male_emotions[col].mean()

for col in female_emotions.columns: 
    f_emo[col] = female_emotions[col].mean()
print(m_emo)
print(f_emo)

{'gender': 1.0, 'anger': 0.010556660039761422, 'anticipation': 0.06117296222664019, 'disgust': 0.017646123260437352, 'fear': 0.035707753479125234, 'joy': 0.03728031809145126, 'negative': 0.05875149105367803, 'positive': 0.07716103379721675, 'sadness': 0.032047713717693826, 'surprise': 0.011101391650099393, 'trust': 0.06385288270377744}
{'gender': 0.0, 'anger': 0.015487077534791241, 'anticipation': 0.052982107355864845, 'disgust': 0.02133598409542741, 'fear': 0.03894234592445329, 'joy': 0.030206759443339937, 'negative': 0.05649105367793238, 'positive': 0.07070775347912524, 'sadness': 0.03418290258449304, 'surprise': 0.012723658051689844, 'trust': 0.05784294234592448}

	author	id	text	time	date	year
0	Analmarsh	birthcontrol_812tvt_post	Ok here’s my situation. It’s wednesday. I had ...	2018-03-01 05:07:12	2018-03-01	2018
1	the_crane_wife	birthcontrol_8132ww_post	Hi ladies, \nSo I've had my IUD for over a yea...	2018-03-01 05:53:47	2018-03-01	2018
2	danascullyvevo	birthcontrol_81350s_post	Hi y'all!\n\nSo question -- I have Kyleena IUD...	2018-03-01 06:05:16	2018-03-01	2018
3	alyssas_888	birthcontrol_813vd0_post	Did anyone else have low libido and anxiety af...	2018-03-01 08:43:57	2018-03-01	2018
4	holycornchips	birthcontrol_8154qq_post	I've learned a great deal from this sub and it...	2018-03-01 13:10:34	2018-03-01	2018

	author	id	text	time	date	year	Poster_Gender
0	Analmarsh	birthcontrol_812tvt_post	Ok here’s my situation. It’s wednesday. I had ...	2018-03-01 05:07:12	2018-03-01	2018	None
1	the_crane_wife	birthcontrol_8132ww_post	Hi ladies, \nSo I've had my IUD for over a yea...	2018-03-01 05:53:47	2018-03-01	2018	None
2	danascullyvevo	birthcontrol_81350s_post	Hi y'all!\n\nSo question -- I have Kyleena IUD...	2018-03-01 06:05:16	2018-03-01	2018	None
3	alyssas_888	birthcontrol_813vd0_post	Did anyone else have low libido and anxiety af...	2018-03-01 08:43:57	2018-03-01	2018	None
4	holycornchips	birthcontrol_8154qq_post	I've learned a great deal from this sub and it...	2018-03-01 13:10:34	2018-03-01	2018	None

	author	id	text	time	date	year	Pronouns_F	Pronouns_M
0	Analmarsh	birthcontrol_812tvt_post	Ok here’s my situation. It’s wednesday. I had ...	2018-03-01 05:07:12	2018-03-01	2018	8	0
1	the_crane_wife	birthcontrol_8132ww_post	Hi ladies, \nSo I've had my IUD for over a yea...	2018-03-01 05:53:47	2018-03-01	2018	11	4
2	danascullyvevo	birthcontrol_81350s_post	Hi y'all!\n\nSo question -- I have Kyleena IUD...	2018-03-01 06:05:16	2018-03-01	2018	12	0
3	alyssas_888	birthcontrol_813vd0_post	Did anyone else have low libido and anxiety af...	2018-03-01 08:43:57	2018-03-01	2018	6	0
4	holycornchips	birthcontrol_8154qq_post	I've learned a great deal from this sub and it...	2018-03-01 13:10:34	2018-03-01	2018	7	0

	author	id	text	time	date	year	Pronouns_F	Pronouns_M
0	Analmarsh	birthcontrol_812tvt_post	Ok here’s my situation. It’s wednesday. I had ...	2018-03-01 05:07:12	2018-03-01	2018	8	0
1	the_crane_wife	birthcontrol_8132ww_post	Hi ladies, \nSo I've had my IUD for over a yea...	2018-03-01 05:53:47	2018-03-01	2018	11	4
2	danascullyvevo	birthcontrol_81350s_post	Hi y'all!\n\nSo question -- I have Kyleena IUD...	2018-03-01 06:05:16	2018-03-01	2018	12	0
3	alyssas_888	birthcontrol_813vd0_post	Did anyone else have low libido and anxiety af...	2018-03-01 08:43:57	2018-03-01	2018	6	0
4	holycornchips	birthcontrol_8154qq_post	I've learned a great deal from this sub and it...	2018-03-01 13:10:34	2018-03-01	2018	7	0

	author	id	text	time	date	year	Pronouns_F	Pronouns_M	Words_M	Poster_Gender
11	Anonthealex	birthcontrol_817j1w_post	I had sex with my now ex girlfriend on the 5th...	2018-03-01 18:35:47	2018-03-01	2018	13	6	1	1
34	ttty23	birthcontrol_81eesc_post	Hey guys so my gf has been on the pill for 5 m...	2018-03-02 14:05:17	2018-03-02	2018	4	19	2	1
215	alexxiskayla	birthcontrol_83arbl_post	My friend has been on birth control for a year...	2018-03-09 22:29:35	2018-03-09	2018	3	9	0	1
234	HoldenMegroin87	birthcontrol_83eu92_post	So im looming for some info and insight for my...	2018-03-10 12:18:31	2018-03-10	2018	4	5	1	1
258	Bestojojo	birthcontrol_83maky_post	My gf and I had unproptected sex in the first ...	2018-03-11 11:52:14	2018-03-11	2018	3	4	1	1

Bearing the Birth Control Burden¶

An exploration of men and women's posts on r/birthcontrol¶

NetID: laf229¶

Introduction & Overview¶

Data & Methods¶

Part 1: Scraping Reddit Data from r/birthcontrol¶

Part 2: Assigning Gold Label Gender of Poster¶

Part 3: Remove Pronouns and Relationship Words from the Texts & Create a Matrix of Word Embeddings¶

Part 4: The Big Moment!! Logistic Regression!¶

Part 5: Bonus / Fun Little Explorations¶

Results¶

Part 1: Scraping Reddit Data from r/birthcontrol¶

Part 2: Assigning Gold Label Gender of Poster¶

Part 3: Remove Pronouns and Relationship Words from the Texts & Create a Matrix of Word Embeddings¶

Part 4: The Big Moment!! Logistic Regression!¶

Part 5: Bonus / Fun Little Explorations¶

A. Word Clouds¶

B. T-Test Word Lists¶

C. Sentiment Scroring¶

Discussion & Conclusion:¶

Limitations¶

Discussion¶

Conclusion¶

	author	id	type	text	url	link_id	parent_id	subreddit	created_utc	time	date	year
0	Analmarsh	birthcontrol_812tvt_post	post	Ok here’s my situation. It’s wednesday. I had ...	https://www.reddit.com/r/birthcontrol/comments...	N/A	N/A	birthcontrol	1519880832	2018-03-01 05:07:12	2018-03-01	2018
1	the_crane_wife	birthcontrol_8132ww_post	post	Hi ladies, \nSo I've had my IUD for over a yea...	https://www.reddit.com/r/birthcontrol/comments...	N/A	N/A	birthcontrol	1519883627	2018-03-01 05:53:47	2018-03-01	2018
2	danascullyvevo	birthcontrol_81350s_post	post	Hi y'all!\n\nSo question -- I have Kyleena IUD...	https://www.reddit.com/r/birthcontrol/comments...	N/A	N/A	birthcontrol	1519884316	2018-03-01 06:05:16	2018-03-01	2018
3	alyssas_888	birthcontrol_813vd0_post	post	Did anyone else have low libido and anxiety af...	https://www.reddit.com/r/birthcontrol/comments...	N/A	N/A	birthcontrol	1519893837	2018-03-01 08:43:57	2018-03-01	2018
4	holycornchips	birthcontrol_8154qq_post	post	I've learned a great deal from this sub and it...	https://www.reddit.com/r/birthcontrol/comments...	N/A	N/A	birthcontrol	1519909834	2018-03-01 13:10:34	2018-03-01	2018

	gender	anger	anticipation	disgust	fear	joy	negative	positive	sadness	surprise	trust
0	1.0	0.000	0.072	0.029	0.029	0.043	0.072	0.043	0.029	0.014	0.072
1	1.0	0.008	0.070	0.000	0.039	0.031	0.085	0.085	0.047	0.008	0.062
2	1.0	0.000	0.054	0.000	0.036	0.036	0.036	0.036	0.018	0.000	0.036
3	1.0	0.000	0.111	0.000	0.000	0.056	0.083	0.111	0.056	0.000	0.111
4	1.0	0.016	0.031	0.000	0.047	0.000	0.062	0.062	0.047	0.031	0.062