Pandas Practice Problems

Download DataSet :

  1. Given a dataframe containing the following columns: "Name", "Age", "City", and "Salary". How do you select all rows where the age is greater than 30 and the salary is greater than 50000?

Solution
df[(df['Age'] > 30) & (df['Salary'] > 50000)]
  1. How do you calculate the mean, median, and standard deviation of the column "Salary" in the dataframe?

Solution
mean = df['Salary'].mean()
median = df['Salary'].median()
standard_deviation = df['Salary'].std()
  1. Given a dataframe, how do you sort its values by the "Age" column in descending order?

Solution
df.dropna(axis=0, how='any')
  1. How do you group the dataframe by the "City" column and calculate the mean of the "Salary" column for each city?

Solution
df.groupby('City')['Salary'].mean()
  1. Given a dataframe containing the following columns: "Name", "Age", "City", and "Salary". How do you select all rows where the name starts with 'J'?

Solution
df[df['Name'].str.startswith('J')]
  1. How do you drop all duplicate rows in a dataframe?

Solution
df.drop_duplicates(keep='first', inplace=True)

9.How do you rename the columns in a dataframe?

Solution
df.rename(columns={'Old_Column_Name':'New_Column_Name'}, inplace=True)
  1. How do you count the number of unique values in a column?

Solution
df['City'].nunique()
  1. How do you merge two dataframes on a specific column?

Solution

merged_df = pd.merge(df1, df2, on='Column_Name')
  1. How do you create a pivot table from a dataframe?

Solution
pivot_table = df.pivot_table(index='Column1', columns='Column2', values='Column3')
  1. How do you select the first 5 rows of a dataframe?

Solution
df.head(5)
  1. How do you select the last 5 rows of a dataframe?

Solution
df.tail(5)
  1. How do you select a specific column from a dataframe and make it into a new dataframe?

Solution
new_df = df[['Column_Name']]
  1. How do you concatenate two dataframes along the rows?

Solution
concatenated_df = pd.concat([df1, df2], axis=0)
  1. How do you find the maximum and minimum values of a column in a dataframe?

Solution
max_value = df['Column_Name'].max()
min_value = df['Column_Name'].min()
  1. How do you calculate the cumulative sum of a column in a dataframe?

Solution
df['Column_Name'].cumsum()
  1. How do you find the unique values of a column in a dataframe?

Solution
df['Column_Name'].unique()
  1. How do you drop a specific column from a dataframe?

Solution

df.drop('Column_Name', axis=1, inplace=True)
  1. How do you find the number of missing values in each column of a dataframe?

Solution
df.isnull().sum()
  1. How do you fill missing values in a specific column with a specific value?

Solution
df['Column_Name'].fillna(value, inplace=True)
  1. How do you drop all rows with more than two missing values in the dataframe?

Solution
Solution:
df.dropna(thresh=len(df) - 2, axis=0, inplace=True)

Let's Do some Data Cleaning and Analysis on the Practice Data.

Data Cleaning :

Although there are many steps to clean , but lets start with dropping duplicate rows.

Solution
df = df.drop_duplicates()

Fill missing values :

Find a way to fill missing values in Columns with Reasonable values.

Solution
df['Name'].fillna(df['Name'].mode()[0], inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['City'].fillna(df['City'].mode()[0], inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)

Exploratory Data Analysis (EDA) :

Generate a Statistical Summary.

Solution
print(df.describe())

Create histograms for each numeric column :

Solution
df.hist()
plt.show()

Create a bar chart to compare the distribution of salary across different cities :

Solution
grouped_data = df.groupby('City')['Salary'].mean().reset_index()
plt.bar(grouped_data['City'], grouped_data['Salary'])
plt.xlabel('City')
plt.ylabel('Average Salary')
plt.show()

Group the data by city and calculate the mean salary for each city :

Solution
city_group = df.groupby('City')
mean_salary_by_city = city_group['Salary'].mean().reset_index()

Plot the mean salary for each city using a bar plot :

Solution
plt.bar(mean_salary_by_city['City'], mean_salary_by_city['Salary'])
plt.xlabel("City")
plt.ylabel("Mean Salary")
plt.title("Mean Salary by City")
plt.xticks(rotation=45)
plt.show()

Plot a scatter plot to visualize the relationship between Age and Salary :

Solution
plt.scatter(df['Age'], df['Salary'])
plt.xlabel("Age")
plt.ylabel("Salary")
plt.title("Relationship between Age and Salary")
plt.show()

Find the highest-paid person in each city :

Solution
top_paid_by_city = city_group['Salary'].max().reset_index()
print("The highest-paid person in each city:")
print(top_paid_by_city)

Find the youngest and oldest person in the data :

Solution
min_age = df['Age'].min()
max_age = df['Age'].max()
print("The youngest person in the data is", df.loc[df['Age'] == min_age, 'Name'].iloc[0], "with an age of", min_age)
print("The oldest person in the data is", df.loc[df['Age'] == max_age, 'Name'].iloc[0], "with an age of", max_age)

Group the data by age group and city and calculate the mean salary for each group and city

age_group = df.groupby(['Age Group', 'City']) mean_salary_by_age_group_and_city = age_group['Salary'].mean().reset_index()

Pivot the table to display the mean salary for each age group and city

pivot_table = pd.pivot_table(mean_salary_by_age_group_and_city, values='Salary', index='Age Group', columns='City')

print("Mean Salary by Age Group and City:")

print(pivot_table)

Plot the mean salary for each age group and city using a heatmap

sns.heatmap(pivot_table, annot=True, cmap='Blues')

plt.xlabel("City")

plt.ylabel("Age Group")

plt.title("Mean Salary by Age Group and City")

plt.show()

Calculate the top 5 most common names in the data

top_names = df['Name'].value_counts().head()

print("The top 5 most common names in the data:")

print(top_names)

Create a new column to indicate whether a person's age is below or above the average age

mean_age = df['Age'].mean()

df['Age Above Average'] = np.where(df['Age']>=mean_age, 1, 0)

Group the data by age above average and city and calculate the median salary for each group and city

age_above_average_group = df.groupby(['Age Above Average', 'City'])

median_salary_by_age_above_avg_and_city=age_above_average_group['Salary'].median().reset_index()

Pivot the table to display the median salary for each age above average group and city

pivot_table2 = pd.pivot_table(median_salary_by_age_above_average_and_city, values='Salary', index='Age Above Average', columns='City')

print("Median Salary by Age Above Average and City:")

print(pivot_table2)

Plot the median salary for each age above average group and city using a stacked bar chart

pivot_table2.plot(kind='bar', stacked=True)

plt.xlabel("Age Above Average")

plt.ylabel("Median Salary")

plt.title("Median Salary by Age Above Average and City")

plt.show()

Last updated