Learning Python
This post is a continuation of a learning python. In this post, I will extend the last post and focus on tidyverse equivalents to:
- Mutate columns
- Summarize data (Groupby)
- Pipe functions (%>%)
Setting up Python in Rmarkdown
Load Modules and Data
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-28/winemag-data-130k-v2.csv'
dat = pd.read_csv(url, index_col=0)
Settings for better output
This blog is built through the blogdown package and uses Rmarkdown files that render to HTML documents. Rmarkdown files best support R output but I do not know how to best format python outputs. I get inconsistent syntax highlighting.
# Stop Python from truncating number of columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('precision', 2)
Mutate
The assign() function appears to be a clean way to create new variables.
dat.assign(value = dat.points / dat.price).head(5)
# country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery value
# 0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia NaN
# 1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos 5.80
# 2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm 6.21
# 3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian 6.69
# 4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks 1.34
Summarize (Groupby)
Grouping and summarizing variables is straightforward as well. You can select which variables to group through the groupby() function and then perform different summary statistics. There is an agg() function to indicate which aggregations to perform. You can also use numpy functions directly like sum() to your grouped data. Lastly, you can use a dictionary style of input to summarize your data. I think this dictionary style format is very clean and come in handing when I need to perform various calculations on grouped data with different data types.
dat.groupby('country').agg('mean')
# points price
# country
# Argentina 86.71 24.51
# Armenia 87.50 14.50
# Australia 88.58 35.44
# Austria 90.10 30.76
# Bosnia and Herzegovina 86.50 12.50
# Brazil 84.67 23.77
# Bulgaria 87.94 14.65
# Canada 89.37 35.71
# Chile 86.49 20.79
# China 89.00 18.00
# Croatia 87.22 25.45
# Cyprus 87.18 16.27
# Czech Republic 87.25 24.25
# Egypt 84.00 NaN
# England 91.58 51.68
# France 88.85 41.14
# Georgia 87.69 19.32
# Germany 89.85 42.26
# Greece 87.28 22.36
# Hungary 89.19 40.65
# India 90.22 13.33
# Israel 88.47 31.77
# Italy 88.56 39.66
# Lebanon 87.69 30.69
# Luxembourg 88.67 23.33
# Macedonia 86.83 15.58
# Mexico 85.26 26.79
# Moldova 87.20 16.75
# Morocco 88.57 19.50
# New Zealand 88.30 26.93
# Peru 83.56 18.06
# Portugal 88.25 26.22
# Romania 86.40 15.24
# Serbia 87.50 24.50
# Slovakia 87.00 16.00
# Slovenia 88.07 24.81
# South Africa 88.06 24.67
# Spain 87.29 28.22
# Switzerland 88.57 85.29
# Turkey 88.09 24.63
# US 88.56 36.57
# Ukraine 84.07 9.21
# Uruguay 86.75 26.40
dat.groupby('country').sum()
# points price
# country
# Argentina 329499 9.21e+04
# Armenia 175 2.90e+01
# Australia 206304 8.13e+04
# Austria 301389 8.61e+04
# Bosnia and Herzegovina 173 2.50e+01
# Brazil 4403 1.12e+03
# Bulgaria 12399 2.06e+03
# Canada 22968 9.07e+03
# Chile 386799 9.18e+04
# China 89 1.80e+01
# Croatia 6367 1.81e+03
# Cyprus 959 1.79e+02
# Czech Republic 1047 2.91e+02
# Egypt 84 0.00e+00
# England 6777 3.57e+03
# France 1962855 7.31e+05
# Georgia 7541 1.62e+03
# Germany 194529 8.96e+04
# Greece 40674 1.03e+04
# Hungary 13022 5.89e+03
# India 812 1.20e+02
# Israel 44678 1.55e+04
# Italy 1730506 6.71e+05
# Lebanon 3069 1.07e+03
# Luxembourg 532 1.40e+02
# Macedonia 1042 1.87e+02
# Mexico 5968 1.88e+03
# Moldova 5145 9.88e+02
# Morocco 2480 5.46e+02
# New Zealand 125302 3.71e+04
# Peru 1337 2.89e+02
# Portugal 502232 1.28e+05
# Romania 10368 1.83e+03
# Serbia 1050 2.94e+02
# Slovakia 87 1.60e+01
# Slovenia 7662 1.98e+03
# South Africa 123367 3.19e+04
# Spain 580031 1.85e+05
# Switzerland 620 5.97e+02
# Turkey 7928 2.22e+03
# US 4827077 1.98e+06
# Ukraine 1177 1.29e+02
# Uruguay 9456 2.88e+03
dat.groupby('country').agg(['mean', 'sum']) # Multiple summary statistics
# points price
# mean sum mean sum
# country
# Argentina 86.71 329499 24.51 9.21e+04
# Armenia 87.50 175 14.50 2.90e+01
# Australia 88.58 206304 35.44 8.13e+04
# Austria 90.10 301389 30.76 8.61e+04
# Bosnia and Herzegovina 86.50 173 12.50 2.50e+01
# Brazil 84.67 4403 23.77 1.12e+03
# Bulgaria 87.94 12399 14.65 2.06e+03
# Canada 89.37 22968 35.71 9.07e+03
# Chile 86.49 386799 20.79 9.18e+04
# China 89.00 89 18.00 1.80e+01
# Croatia 87.22 6367 25.45 1.81e+03
# Cyprus 87.18 959 16.27 1.79e+02
# Czech Republic 87.25 1047 24.25 2.91e+02
# Egypt 84.00 84 NaN 0.00e+00
# England 91.58 6777 51.68 3.57e+03
# France 88.85 1962855 41.14 7.31e+05
# Georgia 87.69 7541 19.32 1.62e+03
# Germany 89.85 194529 42.26 8.96e+04
# Greece 87.28 40674 22.36 1.03e+04
# Hungary 89.19 13022 40.65 5.89e+03
# India 90.22 812 13.33 1.20e+02
# Israel 88.47 44678 31.77 1.55e+04
# Italy 88.56 1730506 39.66 6.71e+05
# Lebanon 87.69 3069 30.69 1.07e+03
# Luxembourg 88.67 532 23.33 1.40e+02
# Macedonia 86.83 1042 15.58 1.87e+02
# Mexico 85.26 5968 26.79 1.88e+03
# Moldova 87.20 5145 16.75 9.88e+02
# Morocco 88.57 2480 19.50 5.46e+02
# New Zealand 88.30 125302 26.93 3.71e+04
# Peru 83.56 1337 18.06 2.89e+02
# Portugal 88.25 502232 26.22 1.28e+05
# Romania 86.40 10368 15.24 1.83e+03
# Serbia 87.50 1050 24.50 2.94e+02
# Slovakia 87.00 87 16.00 1.60e+01
# Slovenia 88.07 7662 24.81 1.98e+03
# South Africa 88.06 123367 24.67 3.19e+04
# Spain 87.29 580031 28.22 1.85e+05
# Switzerland 88.57 620 85.29 5.97e+02
# Turkey 88.09 7928 24.63 2.22e+03
# US 88.56 4827077 36.57 1.98e+06
# Ukraine 84.07 1177 9.21 1.29e+02
# Uruguay 86.75 9456 26.40 2.88e+03
dat.groupby('country').agg({'price': 'mean'}) # With a dictionary
# price
# country
# Argentina 24.51
# Armenia 14.50
# Australia 35.44
# Austria 30.76
# Bosnia and Herzegovina 12.50
# Brazil 23.77
# Bulgaria 14.65
# Canada 35.71
# Chile 20.79
# China 18.00
# Croatia 25.45
# Cyprus 16.27
# Czech Republic 24.25
# Egypt NaN
# England 51.68
# France 41.14
# Georgia 19.32
# Germany 42.26
# Greece 22.36
# Hungary 40.65
# India 13.33
# Israel 31.77
# Italy 39.66
# Lebanon 30.69
# Luxembourg 23.33
# Macedonia 15.58
# Mexico 26.79
# Moldova 16.75
# Morocco 19.50
# New Zealand 26.93
# Peru 18.06
# Portugal 26.22
# Romania 15.24
# Serbia 24.50
# Slovakia 16.00
# Slovenia 24.81
# South Africa 24.67
# Spain 28.22
# Switzerland 85.29
# Turkey 24.63
# US 36.57
# Ukraine 9.21
# Uruguay 26.40
Chaining
Python does not appear to need actual operators to chain functions as R does with %>% from tidyverse’s magrittr package. Python can take advantage of the dot notation and continually chain different functions as long as they are wrapped in their own parenthesis.
(dat.loc[:, ['country', 'variety', 'points']] # Select columns
.groupby(['country', 'variety']) # Groupby
.agg('mean') # Summary Statistic
.query('country == "US"') # Filter for US Only
.sort_values('points', ascending = False) # Arrange in descending order
.head(10)) # Show top 10
# points
# country variety
# US Syrah-Petit Verdot 94.0
# Moscato Giallo 94.0
# Riesling-Chardonnay 94.0
# Touriga Nacional Blend 93.0
# Tannat-Syrah 93.0
# Sauvignon Gris 92.0
# Malbec-Cabernet Sauvignon 92.0
# White Port 92.0
# Tempranillo-Cabernet Sauvignon 92.0
# Chardonnay-Albariño 91.5