Aggregating Data
Last updated
Last updated
Mean is the average of the given numbers and is calculated by dividing the sum of given numbers by the total number of numbers.
The Median is the Middle value in the list of numbers.
In [30]:
Out[30]:
0
1
A
1
57258.43
False
62.27
2.719
7.808
2010-04-02
1
1
A
1
16333.14
False
80.91
2.669
7.787
2010-07-02
2
1
A
1
19403.54
False
46.63
2.561
8.106
2010-02-26
3
1
A
1
22517.56
False
49.27
2.708
7.838
2010-12-03
4
1
A
1
17596.96
False
66.32
2.808
7.808
2010-04-16
So, this is our Data about Walmart store. In this module we are going to use Walmart DataSet.
Let's find out the the mean and median for Weekly_Sales column.
In [31]:
Out[31]:
In [32]:
Out[32]:
Maximum and Minimum allow us to see what time range your data covers.
In [33]:
Out[33]:
In [34]:
Out[34]:
By using .max() and .min() we get to know that our DataSet have data from January of 2010 to December of 2012.
Agg Method:
The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super efficient.
Have you noticed that our temperature column is in Fahrenheit and I want to convert it into Celsius. Now, we will make my custom function that will convert temperature from Fahrenheit to Celsius and we will apply it through .agg() method to a DataFrame column.
In [35]:
Out[35]:
Here, we are applying a custom function with the help of .agg() method, you can pass the list of functions to apply more than one function or you can pass the list of columns if you want to apply the function on more than one column.
Cumulative statistics can also be helpful in tracking summary statistics over time. We'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow us to identify what the total sales were so far as well as what the highest weekly sales were so far.
Cumulative sum of weekly_sales
In [36]:
Removing duplicates is very essential to get accurate counts, because we don't want to count the same thing multiple times. In this exercise, we'll create some new DataFrames using unique values from walmart.
In [37]:
Out[37]:
0
1
A
1
57258.43
False
62.27
2.719
7.808
2010-04-02
6946
2
A
1
27023.35
False
69.24
2.603
8.163
2010-10-01
13804
3
B
1
4238.56
False
82.20
2.669
7.346
2010-07-02
19897
4
A
1
36486.28
False
63.96
2.619
7.127
2010-10-01
26713
5
B
1
8588.14
False
71.10
2.603
6.768
2010-10-01
Counting is a great way to get an overview of your data and to spot unknown information that you might not notice. We'll count the number of each type of store and the number of each department number using the DataFrames we created in Dropping Duplicates.
Counting the number of stores of each Type
In [38]:
Out[38]:
Proportion of stores of each Type
Normalize: normalize argument is used to turn the counts into proportion of the total.
In [39]:
Out[39]:
In [40]:
Out[40]:
In [41]:
Out[41]:
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently.
Group by Type to calculate total Weekly_Sales
In [42]:
Out[42]:
Here, we group by the Type and perform sum on the column Weekly_Sales.
Group by Type and Is_Holiday to calculate total Weekly_Sales
In [43]:
Out[43]:
In this case we are grouping by two columns and summarizing one column if you want to perfrom any other summary statistics on more than one column then you have to pass the list of those column.
Earlier we saw that the .agg() method is useful to compute multiple statistics on multiple variables. It also works with .groupby or grouped data.
For each store Type, aggregating Weekly_Sales to gett min, max, mean, and median
In [44]:
Out[44]:
Type
A
-4988.94
381072.11
20079.169258
10096.75
B
-1750.00
693099.36
12263.647825
6197.43
C
-379.00
112152.35
9484.482310
1137.34
Notice that the minimum Weekly_Sales is negative because some stores had more returns than sales.
For each store Type, aggregating Unemployment and Fuel_Price to get min, max, mean, and median
In [45]:
Out[45]:
amin
amax
mean
median
amin
amax
mean
median
Type
A
3.879
14.313
7.798863
7.818
2.472
4.468
3.343295
3.416
B
4.125
14.313
7.936579
7.874
2.514
4.468
3.381915
3.494
C
5.217
14.313
8.948222
8.300
2.514
4.468
3.363569
3.417
Pivot tables are another way of calculating grouped summary statistics. Basically pivot tables are used in spreadsheets, if you have ever used spreadsheets then chances are you have used pivot tables. Let's see how to create pivot tables in pandas.
Pivoting on one variable
We'll perform calculations using .pivot_table() to replicate the calculations we performed in the .groupby().
Pivoting for mean Weekly_Sales for each store Type
In [46]:
Out[46]:
Type
A
20079.169258
B
12263.647825
C
9484.482310
Pivoting for mean and median Weekly_Sales for each store Type
In [47]:
Out[47]:
Weekly_Sales
Weekly_Sales
Type
A
20079.169258
10096.75
B
12263.647825
6197.43
C
9484.482310
1137.34
Pivoting for mean Seekly_Sales by store Type and Holiday
In [48]:
Out[48]:
Type
A
20009.541574
21001.295283
B
12166.429612
13549.646294
C
9464.078381
9757.554889
.pivot_table() method has several useful arguments, including fill_value and margins.
fill_value: replaces missing values with real value. margins: is a shortcut for when we pivoted by two variables, but also wanted to pivot by each of those variables separately, it gives the row and column totals of the pivot table contents.
mean of Weekly_Sales by Department and Type, fill missing values with 0
In [49]:
Out[49]:
Department
1
23280.657103
17821.729825
8955.170884
2
51935.206120
43359.816952
14398.908941
3
14044.489715
12880.306941
803.546661
4
33128.245264
21457.823086
13506.047051
5
26589.677249
21109.894917
766.519121
...
...
...
...
95
97473.593714
41428.486751
50990.968371
96
19771.434235
4874.171608
15720.388253
97
21937.066379
3726.370891
13468.309357
98
10971.473930
317.638051
5406.640087
99
486.289516
21.008889
3.332143
81 rows × 3 columns
mean Weekly_Sales by Department and Type, fill missing values with 0s, sum all rows and cols
In [50]:
Out[50]:
Department
1
23280.657103
17821.729825
8955.170884
19304.958040
2
51935.206120
43359.816952
14398.908941
43558.699406
3
14044.489715
12880.306941
803.546661
11880.837545
4
33128.245264
21457.823086
13506.047051
26111.912583
5
26589.677249
21109.894917
766.519121
21218.756123
...
...
...
...
...
96
19771.434235
4874.171608
15720.388253
15212.524956
97
21937.066379
3726.370891
13468.309357
14291.078869
98
10971.473930
317.638051
5406.640087
6804.735435
99
486.289516
21.008889
3.332143
468.071664
All
20079.169258
12263.647825
9484.482310
15983.429692
82 rows × 4 columns