Learning Python: Part 2

Part two of translating data tasks done in R but with Python.

Learning Python

This post is a continuation of a learning python. In this post, I will extend the last post and focus on tidyverse equivalents to:

  1. Mutate columns
  2. Summarize data (Groupby)
  3. Pipe functions (%>%)

Setting up Python in Rmarkdown

Load Modules and Data

import pandas as pd
import numpy as np

url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-28/winemag-data-130k-v2.csv'
dat = pd.read_csv(url, index_col=0)

Settings for better output

This blog is built through the blogdown package and uses Rmarkdown files that render to HTML documents. Rmarkdown files best support R output but I do not know how to best format python outputs. I get inconsistent syntax highlighting.

# Stop Python from truncating number of columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('precision', 2)

Mutate

The assign() function appears to be a clean way to create new variables.

dat.assign(value = dat.points / dat.price).head(5)
#     country                                        description                         designation  points  price           province             region_1           region_2         taster_name taster_twitter_handle                                              title         variety               winery  value
# 0     Italy  Aromas include tropical fruit, broom, brimston...                        Vulkà Bianco      87    NaN  Sicily & Sardinia                 Etna                NaN       Kerin O’Keefe          @kerinokeefe                  Nicosia 2013 Vulkà Bianco  (Etna)     White Blend              Nicosia    NaN
# 1  Portugal  This is ripe and fruity, a wine that is smooth...                            Avidagos      87   15.0              Douro                  NaN                NaN          Roger Voss            @vossroger      Quinta dos Avidagos 2011 Avidagos Red (Douro)  Portuguese Red  Quinta dos Avidagos   5.80
# 2        US  Tart and snappy, the flavors of lime flesh and...                                 NaN      87   14.0             Oregon    Willamette Valley  Willamette Valley        Paul Gregutt           @paulgwine       Rainstorm 2013 Pinot Gris (Willamette Valley)      Pinot Gris            Rainstorm   6.21
# 3        US  Pineapple rind, lemon pith and orange blossom ...                Reserve Late Harvest      87   13.0           Michigan  Lake Michigan Shore                NaN  Alexander Peartree                   NaN  St. Julian 2013 Reserve Late Harvest Riesling ...        Riesling           St. Julian   6.69
# 4        US  Much like the regular bottling from 2012, this...  Vintner's Reserve Wild Child Block      87   65.0             Oregon    Willamette Valley  Willamette Valley        Paul Gregutt           @paulgwine   Sweet Cheeks 2012 Vintner's Reserve Wild Child...      Pinot Noir         Sweet Cheeks   1.34

Summarize (Groupby)

Grouping and summarizing variables is straightforward as well. You can select which variables to group through the groupby() function and then perform different summary statistics. There is an agg() function to indicate which aggregations to perform. You can also use numpy functions directly like sum() to your grouped data. Lastly, you can use a dictionary style of input to summarize your data. I think this dictionary style format is very clean and come in handing when I need to perform various calculations on grouped data with different data types.

dat.groupby('country').agg('mean')
#                         points  price
# country                              
# Argentina                86.71  24.51
# Armenia                  87.50  14.50
# Australia                88.58  35.44
# Austria                  90.10  30.76
# Bosnia and Herzegovina   86.50  12.50
# Brazil                   84.67  23.77
# Bulgaria                 87.94  14.65
# Canada                   89.37  35.71
# Chile                    86.49  20.79
# China                    89.00  18.00
# Croatia                  87.22  25.45
# Cyprus                   87.18  16.27
# Czech Republic           87.25  24.25
# Egypt                    84.00    NaN
# England                  91.58  51.68
# France                   88.85  41.14
# Georgia                  87.69  19.32
# Germany                  89.85  42.26
# Greece                   87.28  22.36
# Hungary                  89.19  40.65
# India                    90.22  13.33
# Israel                   88.47  31.77
# Italy                    88.56  39.66
# Lebanon                  87.69  30.69
# Luxembourg               88.67  23.33
# Macedonia                86.83  15.58
# Mexico                   85.26  26.79
# Moldova                  87.20  16.75
# Morocco                  88.57  19.50
# New Zealand              88.30  26.93
# Peru                     83.56  18.06
# Portugal                 88.25  26.22
# Romania                  86.40  15.24
# Serbia                   87.50  24.50
# Slovakia                 87.00  16.00
# Slovenia                 88.07  24.81
# South Africa             88.06  24.67
# Spain                    87.29  28.22
# Switzerland              88.57  85.29
# Turkey                   88.09  24.63
# US                       88.56  36.57
# Ukraine                  84.07   9.21
# Uruguay                  86.75  26.40
dat.groupby('country').sum()
#                          points     price
# country                                  
# Argentina                329499  9.21e+04
# Armenia                     175  2.90e+01
# Australia                206304  8.13e+04
# Austria                  301389  8.61e+04
# Bosnia and Herzegovina      173  2.50e+01
# Brazil                     4403  1.12e+03
# Bulgaria                  12399  2.06e+03
# Canada                    22968  9.07e+03
# Chile                    386799  9.18e+04
# China                        89  1.80e+01
# Croatia                    6367  1.81e+03
# Cyprus                      959  1.79e+02
# Czech Republic             1047  2.91e+02
# Egypt                        84  0.00e+00
# England                    6777  3.57e+03
# France                  1962855  7.31e+05
# Georgia                    7541  1.62e+03
# Germany                  194529  8.96e+04
# Greece                    40674  1.03e+04
# Hungary                   13022  5.89e+03
# India                       812  1.20e+02
# Israel                    44678  1.55e+04
# Italy                   1730506  6.71e+05
# Lebanon                    3069  1.07e+03
# Luxembourg                  532  1.40e+02
# Macedonia                  1042  1.87e+02
# Mexico                     5968  1.88e+03
# Moldova                    5145  9.88e+02
# Morocco                    2480  5.46e+02
# New Zealand              125302  3.71e+04
# Peru                       1337  2.89e+02
# Portugal                 502232  1.28e+05
# Romania                   10368  1.83e+03
# Serbia                     1050  2.94e+02
# Slovakia                     87  1.60e+01
# Slovenia                   7662  1.98e+03
# South Africa             123367  3.19e+04
# Spain                    580031  1.85e+05
# Switzerland                 620  5.97e+02
# Turkey                     7928  2.22e+03
# US                      4827077  1.98e+06
# Ukraine                    1177  1.29e+02
# Uruguay                    9456  2.88e+03
dat.groupby('country').agg(['mean', 'sum']) # Multiple summary statistics
#                        points           price          
#                          mean      sum   mean       sum
# country                                                
# Argentina               86.71   329499  24.51  9.21e+04
# Armenia                 87.50      175  14.50  2.90e+01
# Australia               88.58   206304  35.44  8.13e+04
# Austria                 90.10   301389  30.76  8.61e+04
# Bosnia and Herzegovina  86.50      173  12.50  2.50e+01
# Brazil                  84.67     4403  23.77  1.12e+03
# Bulgaria                87.94    12399  14.65  2.06e+03
# Canada                  89.37    22968  35.71  9.07e+03
# Chile                   86.49   386799  20.79  9.18e+04
# China                   89.00       89  18.00  1.80e+01
# Croatia                 87.22     6367  25.45  1.81e+03
# Cyprus                  87.18      959  16.27  1.79e+02
# Czech Republic          87.25     1047  24.25  2.91e+02
# Egypt                   84.00       84    NaN  0.00e+00
# England                 91.58     6777  51.68  3.57e+03
# France                  88.85  1962855  41.14  7.31e+05
# Georgia                 87.69     7541  19.32  1.62e+03
# Germany                 89.85   194529  42.26  8.96e+04
# Greece                  87.28    40674  22.36  1.03e+04
# Hungary                 89.19    13022  40.65  5.89e+03
# India                   90.22      812  13.33  1.20e+02
# Israel                  88.47    44678  31.77  1.55e+04
# Italy                   88.56  1730506  39.66  6.71e+05
# Lebanon                 87.69     3069  30.69  1.07e+03
# Luxembourg              88.67      532  23.33  1.40e+02
# Macedonia               86.83     1042  15.58  1.87e+02
# Mexico                  85.26     5968  26.79  1.88e+03
# Moldova                 87.20     5145  16.75  9.88e+02
# Morocco                 88.57     2480  19.50  5.46e+02
# New Zealand             88.30   125302  26.93  3.71e+04
# Peru                    83.56     1337  18.06  2.89e+02
# Portugal                88.25   502232  26.22  1.28e+05
# Romania                 86.40    10368  15.24  1.83e+03
# Serbia                  87.50     1050  24.50  2.94e+02
# Slovakia                87.00       87  16.00  1.60e+01
# Slovenia                88.07     7662  24.81  1.98e+03
# South Africa            88.06   123367  24.67  3.19e+04
# Spain                   87.29   580031  28.22  1.85e+05
# Switzerland             88.57      620  85.29  5.97e+02
# Turkey                  88.09     7928  24.63  2.22e+03
# US                      88.56  4827077  36.57  1.98e+06
# Ukraine                 84.07     1177   9.21  1.29e+02
# Uruguay                 86.75     9456  26.40  2.88e+03
dat.groupby('country').agg({'price': 'mean'}) # With a dictionary
#                         price
# country                      
# Argentina               24.51
# Armenia                 14.50
# Australia               35.44
# Austria                 30.76
# Bosnia and Herzegovina  12.50
# Brazil                  23.77
# Bulgaria                14.65
# Canada                  35.71
# Chile                   20.79
# China                   18.00
# Croatia                 25.45
# Cyprus                  16.27
# Czech Republic          24.25
# Egypt                     NaN
# England                 51.68
# France                  41.14
# Georgia                 19.32
# Germany                 42.26
# Greece                  22.36
# Hungary                 40.65
# India                   13.33
# Israel                  31.77
# Italy                   39.66
# Lebanon                 30.69
# Luxembourg              23.33
# Macedonia               15.58
# Mexico                  26.79
# Moldova                 16.75
# Morocco                 19.50
# New Zealand             26.93
# Peru                    18.06
# Portugal                26.22
# Romania                 15.24
# Serbia                  24.50
# Slovakia                16.00
# Slovenia                24.81
# South Africa            24.67
# Spain                   28.22
# Switzerland             85.29
# Turkey                  24.63
# US                      36.57
# Ukraine                  9.21
# Uruguay                 26.40

Chaining

Python does not appear to need actual operators to chain functions as R does with %>% from tidyverse’s magrittr package. Python can take advantage of the dot notation and continually chain different functions as long as they are wrapped in their own parenthesis.

(dat.loc[:, ['country', 'variety', 'points']]  # Select columns
    .groupby(['country', 'variety'])           # Groupby
    .agg('mean')                               # Summary Statistic
    .query('country == "US"')                  # Filter for US Only
    .sort_values('points', ascending = False)  # Arrange in descending order
    .head(10))                                 # Show top 10  
#                                         points
# country variety                               
# US      Syrah-Petit Verdot                94.0
#         Moscato Giallo                    94.0
#         Riesling-Chardonnay               94.0
#         Touriga Nacional Blend            93.0
#         Tannat-Syrah                      93.0
#         Sauvignon Gris                    92.0
#         Malbec-Cabernet Sauvignon         92.0
#         White Port                        92.0
#         Tempranillo-Cabernet Sauvignon    92.0
#         Chardonnay-Albariño               91.5
 
comments powered by Disqus