The tech industry's demanding work environment has highlighted the importance of addressing mental health issues among its workers. This project aims to analyze a comprehensive dataset collected by Open Source Mental Illness (OSMI) from 2014 to 2019 to gain insights into tech workers' attitudes towards mental health and the prevalence of mental illnesses in the industry. We will use Python and SQL to clean and process the data to ensure its accuracy and reliability.
We will start by loading the data into a Pandas dataframe and performing exploratory data analysis.
This will include creating statistical summaries and charts, testing for anomalies, checking for correlations and other relations between variables, and other EDA elements.
We will provide clear explanations in our notebook to inform the reader what we are trying to achieve, what results we got, and what these results mean.
We will also provide suggestions about how our analysis can be improved.
import numpy as np
import pandas as pd
import sqlite3 as sql
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora
import gensim
We can begin by loading the data and then examining the structure of each table, the type of data it contains, and identifying any missing values that may require attention.
# Load the data from three tables of sqlite file
conn = sql.connect('/Users/vytautas/Downloads/mental_health.sqlite')
query = "SELECT * FROM Survey;"
survey_df = pd.read_sql_query(query, conn)
query = "SELECT * FROM Question;"
question_df = pd.read_sql_query(query, conn)
query = "SELECT * FROM Answer;"
answer_df = pd.read_sql_query(query, conn)
# Now we can display first few rows of each dataframe
for df, name in zip([survey_df, question_df, answer_df], ['Survey', 'Question', 'Answer']):
print(f'\nFirst few rows of {name} DataFrame:\n', df.head())
First few rows of Survey DataFrame: SurveyID Description 0 2014 mental health survey for 2014 1 2016 mental health survey for 2016 2 2017 mental health survey for 2017 3 2018 mental health survey for 2018 4 2019 mental health survey for 2019 First few rows of Question DataFrame: questiontext questionid 0 What is your age? 1 1 What is your gender? 2 2 What country do you live in? 3 3 If you live in the United States, which state ... 4 4 Are you self-employed? 5 First few rows of Answer DataFrame: AnswerText SurveyID UserID QuestionID 0 37 2014 1 1 1 44 2014 2 1 2 32 2014 3 1 3 31 2014 4 1 4 31 2014 5 1
Initial thoughts, the relationships between 'SurveyID,' 'UserID,' and 'QuestionID' are maintained, allowing for a clear association between surveys, users, and their responses.
The dataset appears well-prepared and organized, providing a solid foundation for further analysis.
for df, name in zip([survey_df, question_df, answer_df], ['Survey', 'Question', 'Answer']):
print(f'\nInformation of {name} DataFrame:\n')
df.info()
print(f'\nDescriptive statistics of {name} DataFrame:\n', df.describe(include='all'))
Information of Survey DataFrame: <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SurveyID 5 non-null int64 1 Description 5 non-null object dtypes: int64(1), object(1) memory usage: 208.0+ bytes Descriptive statistics of Survey DataFrame: SurveyID Description count 5.000000 5 unique NaN 5 top NaN mental health survey for 2016 freq NaN 1 mean 2016.800000 NaN std 1.923538 NaN min 2014.000000 NaN 25% 2016.000000 NaN 50% 2017.000000 NaN 75% 2018.000000 NaN max 2019.000000 NaN Information of Question DataFrame: <class 'pandas.core.frame.DataFrame'> RangeIndex: 105 entries, 0 to 104 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 questiontext 105 non-null object 1 questionid 105 non-null int64 dtypes: int64(1), object(1) memory usage: 1.8+ KB Descriptive statistics of Question DataFrame: questiontext questionid count 105 105.000000 unique 105 NaN top Have your previous employers provided mental h... NaN freq 1 NaN mean NaN 61.790476 std NaN 35.589458 min NaN 1.000000 25% NaN 27.000000 50% NaN 66.000000 75% NaN 92.000000 max NaN 118.000000 Information of Answer DataFrame: <class 'pandas.core.frame.DataFrame'> RangeIndex: 236898 entries, 0 to 236897 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AnswerText 236898 non-null object 1 SurveyID 236898 non-null int64 2 UserID 236898 non-null int64 3 QuestionID 236898 non-null int64 dtypes: int64(3), object(1) memory usage: 7.2+ MB Descriptive statistics of Answer DataFrame: AnswerText SurveyID UserID QuestionID count 236898 236898.000000 236898.000000 236898.000000 unique 4215 NaN NaN NaN top -1 NaN NaN NaN freq 56598 NaN NaN NaN mean NaN 2016.572297 2514.516619 48.655898 std NaN 1.421134 1099.462747 36.126215 min NaN 2014.000000 1.000000 1.000000 25% NaN 2016.000000 1691.000000 15.000000 50% NaN 2016.000000 2652.000000 48.000000 75% NaN 2017.000000 3439.000000 80.000000 max NaN 2019.000000 4218.000000 118.000000
The Survey DataFrame contains information about the surveys conducted by Open Source Mental Illness (OSMI) from 2014 to 2019. It has the following characteristics:
Structure: It consists of 2 columns: SurveyID
and Description
.
Data Types: The SurveyID
column is of type int64
, indicating that it holds numerical values. The Description
column is of type object
, suggesting that it holds text-based information.
Size: It has 5 entries, indicating that there are a total of 5 surveys in the dataset.
Descriptive Statistics: The mean value of the SurveyID
column is 2016.8, indicating that the surveys were mostly conducted around 2016. The standard deviation is 1.92, suggesting that the survey IDs are relatively spread out. The minimum and maximum values confirm that the dataset includes surveys from 2014 to 2019.
The Question DataFrame contains information about the questions asked in the surveys. It has the following characteristics:
Structure: It has 2 columns: questiontext
and questionid
.
Data Types: The questiontext
column is of type object
, suggesting that it holds text-based information. The questionid
column is of type int64
, indicating that it holds numerical values.
Size: It has 105 entries, indicating that there are a total of 105 questions in the dataset.
Descriptive Statistics: The mean value of questionid
is 61.79, indicating that the questions were relatively evenly distributed. The standard deviation of 35.59 suggests a significant variation in question IDs. The minimum and maximum values confirm that the dataset includes 105 questions.
The Answer DataFrame contains answers to the questions, including the answer text, survey ID, user ID, and question ID. It has the following characteristics:
Structure: It has 4 columns: AnswerText
, SurveyID
, UserID
, and QuestionID
.
Data Types: The AnswerText
column is of type object
, suggesting that it holds text-based information. The SurveyID
, UserID
, and QuestionID
columns are of type int64
, indicating that they hold numerical values.
Size: It has 236898 entries, indicating that there are a total of 236898 answers in the dataset.
Descriptive Statistics: The average AnswerText
value is NaN, suggesting that the answers are mostly text-based. The average SurveyID
value is 2016.57, indicating that the answers are mostly associated with surveys conducted around 2016. The standard deviation of 1.42 suggests that the answers are relatively spread out. The minimum and maximum values confirm that the answers are associated with all five surveys in the dataset.
# Furthermore we double check for duplicate values and missing values
for df, name in zip([survey_df, question_df, answer_df], ['Survey', 'Question', 'Answer']):
print(f'\nMissing values in {name} DataFrame:\n', df.isnull().sum())
print(f'\nDuplicate rows in {name} DataFrame:', df.duplicated().sum())
Missing values in Survey DataFrame: SurveyID 0 Description 0 dtype: int64 Duplicate rows in Survey DataFrame: 0 Missing values in Question DataFrame: questiontext 0 questionid 0 dtype: int64 Duplicate rows in Question DataFrame: 0 Missing values in Answer DataFrame: AnswerText 0 SurveyID 0 UserID 0 QuestionID 0 dtype: int64 Duplicate rows in Answer DataFrame: 0
The absence of missing values and duplicate rows across all dataframes suggests that the data is complete, unique, and suitable for analysis.
Additionally, the -1 values in the dataset may be an alternative representation of NULL values. We will investigate this further during the exploratory data analysis (EDA) process.
This clean state may minimize the need for extensive data cleaning, making further exploration straightforward, depending on the state and meaning of -1.
Therefore, we can proceed to address the EDA questions, conduct hypothesis testing, and accomplish some of the objectives.
# Quick look on the available questions
questions = question_df['questiontext']
for i, question in enumerate(questions, 1):
print(f"{i}. {question}")
1. What is your age? 2. What is your gender? 3. What country do you live in? 4. If you live in the United States, which state or territory do you live in? 5. Are you self-employed? 6. Do you have a family history of mental illness? 7. Have you ever sought treatment for a mental health disorder from a mental health professional? 8. How many employees does your company or organization have? 9. Is your employer primarily a tech company/organization? 10. Does your employer provide mental health benefits as part of healthcare coverage? 11. Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer? 12. Would you bring up a mental health issue with a potential employer in an interview? 13. Is your primary role within your company related to tech/IT? 14. Do you know the options for mental health care available under your employer-provided health coverage? 15. Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)? 16. Does your employer offer resources to learn more about mental health disorders and options for seeking help? 17. If a mental health issue prompted you to request a medical leave from work, how easy or difficult would it be to ask for that leave? 18. Would you feel comfortable discussing a mental health issue with your coworkers? 19. Would you feel comfortable discussing a mental health issue with your direct supervisor(s)? 20. Do you have medical coverage (private insurance or state-provided) that includes treatment of mental health disorders? 21. Do you know local or online resources to seek help for a mental health issue? 22. Do you have previous employers? 23. Have your previous employers provided mental health benefits? 24. Were you aware of the options for mental health care provided by your previous employers? 25. Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)? 26. Did your previous employers provide resources to learn more about mental health disorders and how to seek help? 27. Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers? 28. Would you have been willing to discuss your mental health with your direct supervisor(s)? 29. Would you be willing to bring up a physical health issue with a potential employer in an interview? 30. How willing would you be to share with friends and family that you have a mental illness? 31. Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace? 32. Have you had a mental health disorder in the past? 33. Do you currently have a mental health disorder? 34. Have you ever been diagnosed with a mental health disorder? 35. If you have a mental health disorder, how often do you feel that it interferes with your work when being treated effectively? 36. If you have a mental health disorder, how often do you feel that it interferes with your work when not being treated effectively (i.e., when you are experiencing symptoms)? 37. What country do you work in? 38. What US state or territory do you work in? 39. If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts? 40. If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees? 41. Do you believe your productivity is ever affected by a mental health issue? 42. If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue? 43. Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace? 44. Would you feel more comfortable talking to your coworkers about your physical health or your mental health? 45. Have you ever discussed your mental health with your employer? 46. Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions. 47. Have you ever discussed your mental health with coworkers? 48. Describe the conversation with coworkers you had about your mental health including their reactions. 49. Have you ever had a coworker discuss their or another coworker's mental health with you? 50. Describe the conversation your coworker had with you about their mental health (please do not use names). 51. Overall, how much importance does your employer place on physical health? 52. Overall, how much importance does your employer place on mental health? 53. If you have revealed a mental health disorder to a client or business contact, how has this affected you or the relationship? 54. If you have revealed a mental health disorder to a coworker or employee, how has this impacted you or the relationship? 55. Was your employer primarily a tech company/organization? 56. Would you have felt more comfortable talking to your previous employer about your physical health or your mental health? 57. Did you ever discuss your mental health with your previous employer? 58. Describe the conversation you had with your previous employer about your mental health, including their reactions and actions taken to address your mental health issue/questions. 59. Did you ever discuss your mental health with a previous coworker(s)? 60. Describe the conversation you had with your previous coworkers about your mental health including their reactions. 61. Did you ever have a previous coworker discuss their or another coworker's mental health with you? 62. Describe the conversation your coworker had with you about their mental health (please do not use names)..1 63. Overall, how much importance did your previous employer place on physical health? 64. Overall, how much importance did your previous employer place on mental health? 65. Are you openly identified at work as a person with a mental health issue? 66. Has being identified as a person with a mental health issue affected your career? 67. How has it affected your career? 68. If they knew you suffered from a mental health disorder, how do you think that your team members/co-workers would react? 69. Describe the circumstances of the badly handled or unsupportive response. 70. Have you observed or experienced supportive or well handled response to a mental health issue in your current or previous workplace? 71. Describe the circumstances of the supportive or well handled response. 72. Overall, how well do you think the tech industry supports employees with mental health issues? 73. Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees. 74. If there is anything else you would like to tell us that has not been covered by the survey questions, please use this space to do so. 75. Would you be willing to talk to one of us more extensively about your experiences with mental health issues in the tech industry? (Note that all interview responses would be used _anonymously_ and only with your permission.) 76. What is your race? 77. Do you think that discussing a physical health issue with your employer would have negative consequences? 78. Do you feel that your employer takes mental health as seriously as physical health? 79. If you have a mental health condition, do you feel that it interferes with your work? 80. Do you work remotely (outside of an office) at least 50% of the time? 81. Do you know the options for mental health care your employer provides? 82. Has your employer ever discussed mental health as part of an employee wellness program? 83. Does your employer provide resources to learn more about mental health issues and how to seek help? 84. How easy is it for you to take medical leave for a mental health condition? 85. Do you think that discussing a mental health issue with your employer would have negative consequences? 86. Would you be willing to discuss a mental health issue with your coworkers? 87. Would you be willing to discuss a mental health issue with your direct supervisor(s)? 88. Would you bring up a physical health issue with a potential employer in an interview? 89. Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace? 90. Any additional notes or comments 91. Do you think that discussing a mental health disorder with your employer would have negative consequences? 92. Have you heard of or observed negative consequences for co-workers who have been open about mental health issues in your workplace? 93. If you have revealed a mental health issue to a client or business contact, do you believe this has impacted you negatively? 94. If you have revealed a mental health issue to a coworker or employee, do you believe this has impacted you negatively? 95. Do you think that discussing a mental health disorder with previous employers would have negative consequences? 96. Do you think that discussing a physical health issue with previous employers would have negative consequences? 97. Would you have been willing to discuss a mental health issue with your previous co-workers? 98. Did you feel that your previous employers took mental health as seriously as physical health? 99. Did you hear of or observe negative consequences for co-workers with mental health issues in your previous workplaces? 100. Do you feel that being identified as a person with a mental health issue would hurt your career? 101. Do you think that team members/co-workers would view you more negatively if they knew you suffered from a mental health issue? 102. If yes, what condition(s) have you been diagnosed with? 103. If maybe, what condition(s) do you believe you have? 104. Which of the following best describes your work position? 105. Do you work remotely?
Specific Questions we will look into:
# Query for variations of the term "mental health"
mental_health_variations_query = """
SELECT SurveyID, COUNT(DISTINCT UserID) as Count
FROM Answer
WHERE (AnswerText LIKE '%mental health%' OR AnswerText LIKE '%mental well-being%' OR AnswerText LIKE '%psychological health%')
AND AnswerText != '-1'
GROUP BY SurveyID;
"""
mental_health_variations_df = pd.read_sql_query(mental_health_variations_query, conn)
# Query for related topics
related_topics_query = """
SELECT SurveyID, COUNT(DISTINCT UserID) as Count
FROM Answer
WHERE (AnswerText LIKE '%stress%' OR AnswerText LIKE '%wellness%' OR AnswerText LIKE '%burnout%')
AND AnswerText != '1'
GROUP BY SurveyID;
"""
related_topics_df = pd.read_sql_query(related_topics_query, conn)
# Merging related queries
df = pd.merge(mental_health_variations_df, related_topics_df, on='SurveyID', suffixes=('_mental_health', '_related_topics'))
# Total participants
total_participants_query = """
SELECT SurveyID, COUNT(DISTINCT UserID) as TotalParticipants
FROM Answer
WHERE AnswerText != '1'
GROUP BY SurveyID;
"""
total_participants_df = pd.read_sql_query(total_participants_query, conn)
# After merging, calculating the normalized count for both mental health and related topics
df = pd.merge(df, total_participants_df, on='SurveyID')
df['NormalizedCount_MentalHealth'] = df['Count_mental_health'] / df['TotalParticipants'] * 100
df['NormalizedCount_RelatedTopics'] = df['Count_related_topics'] / df['TotalParticipants'] * 100
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['SurveyID'], y=df['NormalizedCount_MentalHealth'], name='Mental Health'))
fig.add_trace(go.Scatter(x=df['SurveyID'], y=df['NormalizedCount_RelatedTopics'], name='Related Topics'))
fig.update_layout(
title='Normalized Worker Perspectives Over Time',
xaxis_title='Survey Year',
yaxis_title='Normalized Count (%)',
title_x = 0.5)
fig.show()
The upward trend of "mental health" keyword mentions, including its variations, exhibited a remarkable rise in 2017, surpassing the data from 2014 by 28% and peaking at 38% in 2018. However, this upward trajectory seems to be reversing, indicating a decline in recent years, as the percentage dropped to 34.6% in 2019. These fluctuations suggest that the topic of mental health has gained significant relevance in recent years.
To gain a deeper understanding of these fluctuations, further exploration is essential to identify potential influencing factors. Analysis of factors such as age, gender, company size, and the availability of mental health resources could uncover the reasons behind this shift.
query = """
SELECT SurveyID, AnswerText as Age
FROM Answer
WHERE QuestionID = 1 and AnswerText != '-1'
ORDER BY SurveyID;
"""
df_age = pd.read_sql_query(query, conn)
df_age['Age'] = df_age['Age'].astype(float)
df_age_avg = df_age.groupby('SurveyID').mean()
fig = px.line(df_age_avg, y='Age', title='Average Age Over Time')
fig.update_layout(
xaxis_title='Survey Year',
yaxis_title='Average Age',
title_x=0.5)
fig.show()
Even in the face of the surge in "mental health" keyword mentions in 2018, the average age of individuals affected remained remarkably constant at 35 across 2017, 2018, and even increased slightly to 35.5 in 2019.
This suggests that individuals experiencing mental health issues may be becoming more resilient and mature as they age, possibly due to improved awareness and treatment options.
Moreover, the declining trend in keyword usage during these years could indicate that some individuals are effectively managing their mental health challenges.
In light of these observations, we now turn our attention to gender to determine whether males, females, or other groups were more likely to use the mental health keyword in their responses.
query = """
SELECT SurveyID, AnswerText as Gender
FROM Answer
WHERE QuestionID = 2
ORDER BY SurveyID;
"""
# Grouping based on biologically accepted genders and else
answer_df = pd.read_sql_query(query, conn)
answer_df['Gender'] = answer_df['Gender'].str.lower()
answer_df['Gender'] = answer_df['Gender'].apply(lambda x: x if x in ['male', 'female'] else 'other')
answer_df_grouped = answer_df.groupby(['SurveyID', 'Gender']).size().reset_index(name='Count')
# Getting percentages for the visualizations
total_count = answer_df_grouped.groupby('SurveyID')['Count'].transform('sum')
answer_df_grouped['Percentage'] = (answer_df_grouped['Count'] / total_count) * 100
fig = go.Figure()
for gender in answer_df_grouped['Gender'].unique():
df_subset = answer_df_grouped[answer_df_grouped['Gender'] == gender]
fig.add_trace(go.Bar(
x=df_subset['SurveyID'],
y=df_subset['Percentage'],
name=gender,
text=df_subset['Percentage'].round(1).astype(str) + '%',
textposition='auto',
))
fig.update_layout(
barmode='group',
xaxis_title='Year',
yaxis_title='Percentage',
title='Gender Distribution Over Time',
title_x=0.5,
legend_title_text='Gender')
fig.show()
While the gender distribution of respondents is predominantly male, with over 60%, followed by women and others, it is noteworthy that the percentage of women participating is increasing alongside the trend of "mental health" keyword usage.
In 2014, only 19.6% of respondents were women, and by 2018, this percentage had increased to 30%, suggesting a potential association between female gender and increased mental health concerns.
However, this trend appears to reverse when female participation declines. In 2019, female participation decreased by 2.2%, from 30% in 2018 to 27.8% in 2019.
Furthermore, we can proceed to examine whether the impact is related to the company size and its resources towards addressing mental health challenges.
query = """
SELECT SurveyID, AnswerText as Size, COUNT(*) as Occurrences
FROM Answer
WHERE QuestionID = 8 and AnswerText != '-1'
GROUP BY SurveyID, AnswerText
ORDER BY SurveyID, AnswerText
"""
df_size = pd.read_sql_query(query, conn)
# Calculate total occurrences for each survey year
total_occurrences = df_size.groupby('SurveyID')['Occurrences'].transform('sum')
# Calculate percentages for each survey year
df_size['Percentage'] = (df_size['Occurrences'] / total_occurrences) * 100
fig = px.bar(df_size, x='SurveyID', y='Percentage', color='Size',
labels={'Percentage': '% of Respondents', 'SurveyID': 'Survey Year'},
title='Company Size Distribution Over Time',
barmode='group',
text=df_size['Percentage'].round(1).astype(str) + '%')
fig.update_layout(
title_x=0.5,
legend_title_text='Company Size'
)
fig.show()
We can observe that when mental health terms and its variations were used the most, smaller companies were more involved.
Suggesting, that perhaps the workload is way higher there and stress levels are rising.
Even though the number of survey participants is decreasing, we notice that fewer people discuss mental health in larger, well-established companies with employee counts of 1000 or more.
Therefore, we need to examine the trend of mental health resources available for employees. However, before we do that, it would be beneficial to investigate the relationship between CompanySize and AttitudesCount.
This will help us understand how attitudes towards mental health vary across different company sizes, which could either confirm or deny our observations.
# Company size
company_size_query = """
SELECT S.SurveyID, A1.AnswerText as CompanySize
FROM Survey as S
JOIN Answer as A1
ON S.SurveyID = A1.SurveyID AND A1.QuestionID = 8
WHERE A1.AnswerText != '-1'
ORDER BY S.SurveyID;
"""
df_company_size = pd.read_sql_query(company_size_query, conn)
# Attitudes Towards Mental Health
attitudes_query = """
SELECT S.SurveyID, COUNT(DISTINCT A2.UserID) as AttitudesCount
FROM Survey as S
JOIN Answer as A2
ON S.SurveyID = A2.SurveyID
WHERE (A2.AnswerText LIKE '%mental health%' OR A2.AnswerText LIKE '%mental well-being%' OR A2.AnswerText LIKE '%psychological health%')
AND A2.AnswerText != '-1'
GROUP BY S.SurveyID;
"""
df_attitudes = pd.read_sql_query(attitudes_query, conn)
df_merged = pd.merge(df_company_size, df_attitudes, on='SurveyID')
group_stats = df_merged.groupby('CompanySize')['AttitudesCount'].mean().sort_values()
fig = px.box(df_merged, x='CompanySize', y='AttitudesCount', category_orders={"CompanySize": group_stats.index.tolist()})
fig.update_layout(
title='Box Plot of AttitudesCount Across Company Sizes',
xaxis_title='Company Size',
yaxis_title='Attitudes Count',
title_x=0.5
)
fig.show()
Our observations indicate that smaller companies tend to discuss mental health more openly than larger companies, particularly during periods when mental health terms are more prevalent in the surveys.
This suggests that smaller organizations may experience higher workloads and stress levels, leading to a greater need for open dialogue about mental health concerns.
In contrast, despite having fewer survey participants, larger companies with over 1000 employees exhibited a lower frequency of discussions about mental health.
This finding raises questions about the availability of mental health resources and support systems in these larger organizations.
query = """
SELECT
SurveyID,
CASE
WHEN AnswerText IN ("Don't know", "I don't know", "No", "Not eligible for coverage / NA") THEN 'No'
ELSE AnswerText
END AS GroupedAnswer,
COUNT(*) as ResponseCount
FROM Answer
WHERE QuestionID = 10 AND AnswerText != '-1'
GROUP BY SurveyID, GroupedAnswer
ORDER BY SurveyID, GroupedAnswer;
"""
df_resources = pd.read_sql_query(query, conn)
# Percentage calculation
total_counts = df_resources.groupby('SurveyID')['ResponseCount'].transform('sum')
df_resources['Percentage'] = (df_resources['ResponseCount'] / total_counts) * 100
# Grouped bar chart
fig = px.bar(df_resources, x='SurveyID', y='ResponseCount', color='GroupedAnswer',
labels={'ResponseCount': 'Response Count', 'SurveyID': 'Survey Year'},
title='Availability of mental resources',
barmode='group',
text=df_resources['Percentage'].round(1).astype(str) + '%'
)
fig.update_layout(
legend_title_text='Answer',
title_x=0.5
)
fig.show()
Furthermore, we check what percentage do not know about the availability of mental resources.
query = """
SELECT
SurveyID,
CASE
WHEN AnswerText IN ("No", "Not eligible for coverage / NA") THEN 'No'
WHEN AnswerText IN ("Don't know", "I don't know") THEN 'Maybe'
ELSE AnswerText
END AS GroupedAnswer,
COUNT(*) as ResponseCount
FROM Answer
WHERE QuestionID = 10 AND AnswerText != '-1'
GROUP BY SurveyID, GroupedAnswer
ORDER BY SurveyID, GroupedAnswer;
"""
df_resources1 = pd.read_sql_query(query, conn)
# Percentage calculation
total_counts1 = df_resources1.groupby('SurveyID')['ResponseCount'].transform('sum')
df_resources1['Percentage'] = (df_resources1['ResponseCount'] / total_counts1) * 100
fig = px.bar(df_resources1, x='SurveyID', y='ResponseCount', color='GroupedAnswer',
labels={'ResponseCount': 'Response Count', 'SurveyID': 'Survey Year'},
title='Availability of mental resources with no assumptions',
barmode='group',
text=df_resources1['Percentage'].round(1).astype(str) + '%'
)
fig.update_layout(
legend_title_text='Answer',
title_x=0.5
)
fig.show()
We can observe that the percentage of people who do not know about the existence of mental health resources in their company fluctuated between 32.4% in 2014 and 23.5% in 2018.
Furthermore, the analysis shows that a greater number of larger companies (with over 1000 employees) corresponded with a higher availability of mental health resources.
This suggests that larger organizations may have more resources available to support their employees' mental health.
The discussion of "mental health" and related terms experienced a significant surge in 2017 and 2018, with corresponding increases of 34% and 38%, respectively, indicating a growing relevance of the topic.
However, the trend reversed in 2019 to 34.65%, suggesting a declining interest.
Despite the peak in keyword usage in 2018, the average age of affected individuals remained stable of 35 years old, indicating potential resilience and improved awareness.
The gender distribution shifted, with an increasing percentage of women participating alongside the rise in mental health discussions.
However, female participation declined in 2019, suggesting a potential correlation between the keyword usage and female participation, can be analyzed further.
Smaller companies engaged more in mental health discussions during peak periods, possibly due to higher workloads.
Larger companies, despite fewer participants, discussed mental health less, raising questions about resource availability.
Larger companies with over 1000 employees demonstrated a higher availability of mental health resources, suggesting better support systems.
Investigate the prevalence and changes in mental health disorders among tech professionals over time.
Specific Questions we will look into:
query = """
SELECT
SurveyID,
CASE
WHEN LOWER(AnswerText) = "don't know" OR LOWER(AnswerText) = 'possibly' THEN 'Maybe'
ELSE AnswerText
END as MentalHealthDisorder,
COUNT(*) as DisorderCount
FROM
Answer
WHERE
QuestionID IN (32, 33, 34)
AND AnswerText != '-1'
GROUP BY
SurveyID,
MentalHealthDisorder
ORDER BY
SurveyID,
MentalHealthDisorder;
"""
df_disorders = pd.read_sql_query(query, conn)
# Percentage calculation
total_counts = df_disorders.groupby('SurveyID')['DisorderCount'].transform('sum')
df_disorders['Percentage'] = (df_disorders['DisorderCount'] / total_counts) * 100
fig = px.bar(df_disorders,
x='SurveyID',
y='DisorderCount',
color='MentalHealthDisorder',
labels={'DisorderCount': 'Reported Disorders Count', 'SurveyID': 'Survey Year'},
title='Variations in Reported Mental Health Disorders Across Survey Years',
barmode='group',
text=df_disorders['Percentage'].round(1).astype(str) + '%')
fig.update_layout(
legend_title_text='Answer',
title_x = 0.5
)
fig.show()
Mental health disorders reached a peak in 2018, with 57% of respondents indicating 'yes.'
However, there was a reversal in 2019, with only 51.5% responding 'yes.'
This trend aligns with the increase in the percentage of female participants and the corresponding percentage of reported mental health disorders.
This suggests that females may be more open about mental health issues, or there may be a general trend in this direction.
Therefore, it would be worthwhile to investigate the correlation between response count and disorder count to determine if there is any significant relationship.
Finally, it would be valuable to explore in future research whether female participants are genuinely more open about mental health or if they experience higher levels of workplace stress due to gender-related issues and stereotypes.
# Merge the dataframes
merged_df = pd.merge(df_disorders, df_resources, on='SurveyID', suffixes=('_disorders', '_resources'))
merged_df['Percentage_disorders'] = (merged_df['DisorderCount'] / merged_df.groupby('SurveyID')['DisorderCount'].transform('sum')) * 100
merged_df['Percentage_resources'] = (merged_df['ResponseCount'] / merged_df.groupby('SurveyID')['ResponseCount'].transform('sum')) * 100
correlation_variables = merged_df[['Percentage_disorders', 'Percentage_resources']]
correlation_matrix = correlation_variables.corr()
print(correlation_matrix)
plt.figure(figsize=(6, 4))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
plt.title('Correlation Matrix: Mental Health Disorders vs. Availability of Resources')
plt.show()
Percentage_disorders Percentage_resources Percentage_disorders 1.000000e+00 -2.025954e-17 Percentage_resources -2.025954e-17 1.000000e+00
The correlation between the percentages of reported mental health disorders and the availability of mental health resources is very close to zero.
This means that, based on the data we have, there is no clear pattern or trend suggesting that when the percentage of reported mental health disorders increases or decreases, the percentage of available mental health resources tends to do the same.
In simpler terms, these two factors don't seem to be strongly connected in a straightforward, predictable way.
It iss important to keep in mind that correlation doe not tell us about causation, and other factors not considered in our analysis might play a role.
The lack of a strong correlation suggests that, at least in a linear sense, these particular aspects of mental health disorders and resource availability do not show a clear relationship in our dataset.
The analysis shows that there were significant changes in the number of people who reported having mental health disorders over the years studied.
The number peaked in 2018 (57%) and then decreased in 2019 (51.5%).
This trend coincides with an increase in the number of female participants, and a corresponding increase in the percentage of them reporting mental health disorders.
These findings raise questions about whether this shift is due to increased openness among females or reflects broader societal trends, possibly linked to workplace stress related to gender issues.
Correlation analysis shows that there is a very weak correlation between reported mental health disorders and the availability of mental health resources.
This suggests that there is no clear linear relationship between these two factors. It is important to remember that correlation does not equal causation.
The lack of a strong correlation highlights the need for a more in-depth understanding of the factors that influence mental health trends in this dataset.
Further research is needed to identify additional factors that contribute to the observed patterns.
stop = set(stopwords.words('english'))
lemma = WordNetLemmatizer()
texts = question_df['questiontext'].apply(lambda x: [lemma.lemmatize(word) for word in x.lower().split() if word not in stop])
dictionary = corpora.Dictionary(texts)
doc_term_matrix = [dictionary.doc2bow(text) for text in texts]
lda = gensim.models.ldamodel.LdaModel(doc_term_matrix, num_topics=5, id2word = dictionary, passes=50)
topics = lda.print_topics()
for topic in topics:
print(topic[0])
word_weight_pairs = topic[1].split(' + ')
for pair in word_weight_pairs[:5]:
print(pair)
print('\n')
0 0.104*"health" 0.103*"mental" 0.037*"ever" 0.029*"employer" 0.025*"previous" 1 0.059*"mental" 0.054*"health" 0.052*"would" 0.040*"issue" 0.034*"willing" 2 0.073*"health" 0.066*"would" 0.054*"mental" 0.053*"employer" 0.041*"think" 3 0.078*"mental" 0.067*"health" 0.046*"describe" 0.034*"previous" 0.031*"conversation" 4 0.040*"health?" 0.040*"employer" 0.040*"overall," 0.040*"much" 0.040*"place"
The output we have is a list of topics that the LDA model has identified in survey questions, with each topic represented by its top 5 words. By analyzing these words, we can see that the survey should cover the following topics:
Topic 1 seems to be about mental health resources provided by the employer.
Topic 2 might be about previous experiences with mental health in the workplace.
Topic 3 could be about conversations around mental health at work.
Topic 4 appears to be about potential mental health issues in the workplace.
Topic 5 might be about wellness programs or initiatives related to mental health.
Based on this model and the questions we have analyzed, we can confidently say that we have identified the most crucial questions that directly impact employee mental health.
Diversity and Inclusion: Explore opportunities to enhance diversity and inclusion by incorporating additional demographic questions that capture a broader range of identities, such as ethnicity, sexual orientation, and disability status.
Geographical Specifics: Extend geographical data collection by incorporating more detailed location-related questions. This could include regional information, allowing for a more granular analysis of mental health trends based on location.
Remote Work Impact: Further investigate the impact of remote work on mental health by refining questions related to remote work conditions, challenges, and support. This could provide insights into the unique challenges faced by remote workers.