%load_ext pycodestyle_magic
#%reload_ext pycodestyle_magic
%pycodestyle_on
%flake8_on
%flake8_on --max_line_length 79
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

coursera = pd.read_csv("./data/coursera_data.csv")
coursera.head()

# Check dataframe shape
print(coursera.shape)

(891, 7)

# Rename the "Unnamed: 0" column to "Course Id"
coursera.rename(columns={'Unnamed: 0': 'course_id'}, inplace=True)

# Set the "Course Id" column as the index
coursera.set_index('course_id', inplace=True)
coursera.head()

# Define the conversion dictionary
conversion_dict = {'k': '*1e3', 'm': '*1e6'}

# Replace characters and evaluate expressions
coursera['course_students_enrolled'] = (
    coursera['course_students_enrolled']
    .replace(conversion_dict, regex=True)
    .map(pd.eval)
    .astype(int)
)
coursera.head()

# Check for duplicate rows (samples)
duplicate_samples = coursera[coursera.duplicated()]
print("Duplicate Samples:")
print(duplicate_samples)

Duplicate Samples:
Empty DataFrame
Columns: [course_title, course_organization, course_Certificate_type, course_rating, course_difficulty, course_students_enrolled]
Index: []

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

courses = coursera[['course_title',
                    'course_organization',
                    'course_Certificate_type']]
distinct_course_titles_count = courses['course_title'].nunique()
number_of_courses_offered = coursera['course_title'].count()
print()

# Define the columns to consider when looking for duplicates
columns = ['course_title']

# Find duplicates
duplicates = coursera[coursera.duplicated(subset=columns,
                                          keep=False)]

# Select and reorder the columns of interest
duplicates = duplicates[['course_title',
                         'course_Certificate_type',
                         'course_organization']]
# Print findings
print("Number of distinct courses offered:", distinct_course_titles_count)
print("Number of courses offered:", number_of_courses_offered)
duplicates.head(6)

Number of distinct courses offered: 888
Number of courses offered: 891

# Check for missing values by columns
missing_values = coursera.isnull().sum()
print("Missing Values:")
print(missing_values)

Missing Values:
course_title                0
course_organization         0
course_Certificate_type     0
course_rating               0
course_difficulty           0
course_students_enrolled    0
dtype: int64

column_dtypes = coursera.dtypes
print("Data types of each column:")
print(column_dtypes)

Data types of each column:
course_title                 object
course_organization          object
course_Certificate_type      object
course_rating               float64
course_difficulty            object
course_students_enrolled      int64
dtype: object

coursera.describe().round(2)

plt.figure(figsize=(8, 7))
plt.boxplot(coursera['course_students_enrolled'])
plt.title('Course Students Enrolled')
plt.ylabel('Number of Students Enrolled')
plt.xticks([1], ['Students Enrolled'])

# Determining the y axis values
plt.yticks(np.arange(0, 3500000, step=150000))

plt.grid(True)
plt.show()

top_courses = coursera.nlargest(10, 'course_students_enrolled')

# Bar plot
fig, ax = plt.subplots(figsize=(6, 6))
sns.barplot(x='course_students_enrolled',
            y='course_title',
            hue='course_organization',
            data=top_courses)

# Rotating x-axis labels for better visibility
plt.xticks(rotation='horizontal')

# Display grid only on x-axis
ax.grid(True, axis='x')

# Plot the top 10 courses and organizations
plt.title('Top 10 Courses by Number of Students Enrolled')
plt.xlabel('Number of Students Enrolled')
plt.ylabel('Course Title')
plt.legend(title='Organization',
           bbox_to_anchor=(1.05, 1),
           loc=2,
           borderaxespad=0.)

# Plot the ratings of the top 10 courses and organizations
plt.figure(figsize=(6, 6))
sns.scatterplot(data=top_courses,
                y='course_title',
                x='course_rating',
                hue='course_organization',
                s=100)
plt.title('Top 10 Courses by Rating')
plt.xlabel('Course Rating')
plt.ylabel('Course Title')
plt.legend(title='Course Organization',
           bbox_to_anchor=(1.05, 1),
           loc=2, borderaxespad=0.)
plt.grid(True)
plt.xlim(3, 6)
plt.show()

import warnings
# Ignore UserWarning - issing glyphs (character representations) -
# CJK (Chinese, Japanese, Korean) characters
warnings.filterwarnings('ignore', category=UserWarning)

bottom_courses = coursera.nsmallest(10, 'course_students_enrolled')

# Define a color palette for the organizations
color_palette = sns.color_palette(
    'Dark2', n_colors=len(bottom_courses['course_organization'].unique()))

# Bar plot
fig, ax = plt.subplots(figsize=(6, 6))
sns.barplot(x='course_students_enrolled',
            y='course_title',
            hue='course_organization',
            data=bottom_courses,
            palette=color_palette)

# Rotating x-axis labels for better visibility
plt.xticks(rotation='horizontal')

# Display grid only on x-axis
ax.grid(True, axis='x')

# Plot the bottom 10 courses and organizations
plt.title('Bottom 10 Courses by Number of Students Enrolled')
plt.xlabel('Number of Students Enrolled')
plt.ylabel('Course Title')
plt.legend(title='Organization',
           bbox_to_anchor=(1.05, 1),
           loc=2,
           borderaxespad=0.)

# Plot the ratings of the bottom 10 courses and organizations
plt.figure(figsize=(6, 6))
sns.scatterplot(data=bottom_courses,
                y='course_title',
                x='course_rating',
                hue='course_organization',
                palette=color_palette,
                s=100)
plt.title('Bottom 10 Courses by Rating')
plt.xlabel('Course Rating')
plt.ylabel('Course Title')
plt.legend(title='Course Organization',
           bbox_to_anchor=(1.05, 1),
           loc=2, borderaxespad=0.)
plt.grid(True)
plt.xlim(3, 6)
plt.show()

%pycodestyle_on
top = coursera['course_organization'].value_counts().nlargest(10)
top = pd.DataFrame({'course_organization': top.index,
                       'course_title': top.values})

sns.set_theme(rc={'figure.figsize':(15,5)})

ax = sns.barplot(x='course_title',
                 y='course_organization',
                 hue='course_organization',
                 data=top)
ax.set_xlabel("Number of Courses")
ax.set_ylabel("University / Company Name")
ax.set_title("Number of courses offered by Universities / Companies")

plt.show()

%pycodestyle_on
result = (
    coursera.groupby('course_organization')['course_students_enrolled']
    .sum()
    .nlargest(10)
)
result = pd.DataFrame({'course_organization': result.index,
                       'students_enrolled': result.values})

sns.set_theme(rc={'figure.figsize':(15,5)})

ax = sns.barplot(x='students_enrolled',
                 y='course_organization',
                 hue='course_organization',
                 data=result)
ax.set_xlabel("Students Enrolled")
ax.set_ylabel("University / Company Name")
ax.set_title("Number of students enrolled by Universities / Companies")

plt.show()

%pycodestyle_on

plt.figure(figsize=(15, 8))

# First Plot:
plt.subplot(2, 2, 1)
sns.histplot(data=coursera, x='course_rating', bins=20, kde=True)
plt.title('Distribution of Course Ratings')
plt.xlabel('Course Rating')
plt.ylabel('Frequency')
plt.grid(True)
plt.subplots_adjust(wspace=0.3)

# Second Plot:
plt.subplot(2, 2, 2)
sns.histplot(data=coursera, x='course_students_enrolled', bins=20, kde=True)
plt.title('Distribution of Student Course Enrollment')
plt.xlabel('Course Enrollment')
plt.ylabel('Frequency')
plt.grid(True)

# Second line plots
fig, axes = plt.subplots(1, 2, figsize=(13, 4))

numeric_df = coursera.select_dtypes(include=["float64", "int64"])
correlation_matrix = numeric_df.corr()

# Third Plot: Correlation matrix
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm",
            fmt=".2f", ax=axes[0])
axes[0].set_title("Correlation Matrix")

# Fourth Plot:
sns.scatterplot(data=coursera, x='course_students_enrolled',
                y='course_rating', alpha=0.6, ax=axes[1])
axes[1].set_xlabel("Students Enrolled")
axes[1].set_ylabel("Course Rating")
axes[1].set_title('Course Students Enrolled vs. Course Rating')

plt.tight_layout()
plt.subplots_adjust(wspace=0.3)
plt.show()

# Defining the order of the difficulty levels
order = ['Beginner', 'Intermediate', 'Mixed', 'Advanced']

plt.figure(figsize=(15, 6))

# First plot: Distribution of Course Difficulty Levels
plt.subplot(1, 2, 1)
sns.countplot(data=coursera, x='course_difficulty')
plt.title('Number of Courses by Difficulty Level')
plt.ylabel('Number of Courses')
plt.xlabel('Difficulty Level')

# Second plot: Course Ratings by Difficulty Level
plt.subplot(1, 2, 2)
boxplot = sns.boxplot(data=coursera, x='course_difficulty',
                      y='course_rating', order=order)
plt.title('Course Ratings by Difficulty Level')
plt.ylabel('Course Rating')
plt.xlabel('Difficulty Level')

# Calculate rating counts for each difficulty level
rating_counts = coursera.groupby('course_difficulty')['course_rating'].count()

# Add rating counts as text on the boxplot
for i in range(len(rating_counts)):
    boxplot.text(i, coursera['course_rating'].min() + 0.15,
                 'n=' + str(rating_counts[order[i]]),
                 horizontalalignment='center')

plt.subplots_adjust(wspace=200)  # Adjust the space between the subplots
plt.tight_layout()
plt.show()

# Create subplots with 1 row and 2 columns
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Ordering columns output
order = ['COURSE', 'SPECIALIZATION', 'PROFESSIONAL CERTIFICATE']

# Bar plot for distribution of certificate types
sns.countplot(data=coursera, x='course_Certificate_type',
              ax=axes[0], order=order)
axes[0].set_title('Distribution of Certificate Types')
axes[0].set_xlabel('Certificate Type')
axes[0].set_ylabel('Count')

# Boxplot for Impact of Certificate Types
sns.boxplot(data=coursera, x='course_Certificate_type',
            y='course_rating', ax=axes[1], order=order)
axes[1].set_title('Impact of Certificate Types on Ratings')
axes[1].set_xlabel('Certificate Type')
axes[1].set_ylabel('Course Rating')

rating_counts = coursera['course_Certificate_type'].value_counts()

# Add rating counts as text on the second boxplot
for i in range(len(rating_counts)):
    axes[1].text(i, coursera['course_rating'].min() + 0.15,
                 'n=' + str(rating_counts[i]),
                 horizontalalignment='center')

plt.tight_layout()
plt.show()

plt.figure(figsize=(10, 6))
sns.violinplot(data=coursera, x='course_difficulty',
               y='course_students_enrolled')
plt.title('Course Students Enrolled by Difficulty Level')
plt.xlabel('Course Difficulty')
plt.ylabel('Course Students Enrolled')
plt.show()

	Unnamed: 0	course_title	course_organization	course_Certificate_type	course_rating	course_difficulty	course_students_enrolled
0	134	(ISC)² Systems Security Certified Practitioner...	(ISC)²	SPECIALIZATION	4.7	Beginner	5.3k
1	743	A Crash Course in Causality: Inferring Causal...	University of Pennsylvania	COURSE	4.7	Intermediate	17k
2	874	A Crash Course in Data Science	Johns Hopkins University	COURSE	4.5	Mixed	130k
3	413	A Law Student's Toolkit	Yale University	COURSE	4.7	Mixed	91k
4	635	A Life of Happiness and Fulfillment	Indian School of Business	COURSE	4.8	Mixed	320k

	course_title	course_organization	course_Certificate_type	course_rating	course_difficulty	course_students_enrolled
course_id
134	(ISC)² Systems Security Certified Practitioner...	(ISC)²	SPECIALIZATION	4.7	Beginner	5.3k
743	A Crash Course in Causality: Inferring Causal...	University of Pennsylvania	COURSE	4.7	Intermediate	17k
874	A Crash Course in Data Science	Johns Hopkins University	COURSE	4.5	Mixed	130k
413	A Law Student's Toolkit	Yale University	COURSE	4.7	Mixed	91k
635	A Life of Happiness and Fulfillment	Indian School of Business	COURSE	4.8	Mixed	320k

	course_title	course_organization	course_Certificate_type	course_rating	course_difficulty	course_students_enrolled
course_id
134	(ISC)² Systems Security Certified Practitioner...	(ISC)²	SPECIALIZATION	4.7	Beginner	5300
743	A Crash Course in Causality: Inferring Causal...	University of Pennsylvania	COURSE	4.7	Intermediate	17000
874	A Crash Course in Data Science	Johns Hopkins University	COURSE	4.5	Mixed	130000
413	A Law Student's Toolkit	Yale University	COURSE	4.7	Mixed	91000
635	A Life of Happiness and Fulfillment	Indian School of Business	COURSE	4.8	Mixed	320000

	course_title	course_Certificate_type	course_organization
course_id
756	Developing Your Musicianship	COURSE	Berklee College of Music
205	Developing Your Musicianship	SPECIALIZATION	Berklee College of Music
181	Machine Learning	SPECIALIZATION	University of Washington
6	Machine Learning	COURSE	Stanford University
241	Marketing Digital	COURSE	Universidade de São Paulo
325	Marketing Digital	SPECIALIZATION	Universidad Austral

	course_rating	course_students_enrolled
count	891.00	891.00
mean	4.68	90552.08
std	0.16	181936.45
min	3.30	1500.00
25%	4.60	17500.00
50%	4.70	42000.00
75%	4.80	99500.00
max	5.00	3200000.00

Table of Contents¶

Introduction¶

Data Cleaning¶

Exploratory Data Analysis¶

1. Distribution between students enrolled by course title, including schools, ratings and number of courses offered: (Top 10 and Bottom 10)¶

2. Distributions and correlation of course rating and student enrolled¶

3. Distributions of course difficulty levels and course ratings comparison¶

4. Impact on ratings and enrollments by certification type¶

5. Distribution of course difficulty levels and the number of students enrolled¶

6. Suggestions about how this analysis can be improved¶

Jupyter notebook setup¶

Data Cleaning¶

Renaming columns and indexing¶

Converting data types¶

Checking for duplicate records¶

Checking for missing values and datatypes¶

Exploratory Data Analysis¶

Distribution of students enrolled¶

1.1 - Distribution between students enrolled by course title - Top 10 schools and ratings¶

1.2 - Distribution between students enrolled by course title - Bottom 10 schools and ratings¶

1.3 - Number of courses offered by Universities / Companies (Top 10)¶

1.4 - Number of students enrolled by Universities / Companies (Top 10)¶

1.5 - Conclusion between students enrolled by course title, including schools, ratings and number of courses offered¶

Top 10¶

Bottom 10¶

Ratings¶

2.1 - Distributions and correlation of course rating and student enrolled¶

2.2 - Conclusion for distributions and correlation of course rating and student enrolled¶

3.1 - Distribution of course difficulty levels and course ratings graphs¶

3.2 - Conclusion for distributions of course difficulty levels and course ratings comparison¶

4.1 - Impact on ratings and enrollments by certification type graphs¶

4.2 - Conclusion for the impact on ratings and enrollments by certification type¶

5.1 - Distribution of course difficulty levels and the number of students enrolled graphs¶

5.2 - Conclusion for distribution of course difficulty levels and the number of students enrolled¶

Suggestions and improvements for this analysys¶