Learning Python
I often see a debate online between Python and R as programming languages for Data Science. I am not interested in continuing that debate here, but rather just want to continue to learn other programming languages and have another tool available to use. I often read online Python is better for general use programming while R was developed with statisticians in mind. If that is the case I think it would be helpful to start learning Python and see where I can use it for my advantage.
In this post, I will just try to practice my typical R tidyverse workflow of:
- Get Data
- Load needed libraries
- Select and filter columns.
Set up Python in Rmarkdown
The reticulate package helps integrate Python into your rmarkdown scripts. After loading the package, you are suggested to specify which environment to use Python. I am still confused on how exactly to do this but I found the following code to be most stable to start my script especially if I have to leave and come back later to finish it.
library(reticulate)
use_virtualenv("r-reticulate")
Load Modules and Data
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-28/winemag-data-130k-v2.csv'
dat = pd.read_csv(url, index_col=0)
# Stop Python from truncating number of columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('precision', 2)
Previewing the Data
View Data in Rstudio
View(py$dat)
Understanding the Big Picture of a Dataset
dat.info() # General Info (i.e. columns, non-null entries, data types)
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 129971 entries, 0 to 129970
# Data columns (total 13 columns):
# country 129908 non-null object
# description 129971 non-null object
# designation 92506 non-null object
# points 129971 non-null int64
# price 120975 non-null float64
# province 129908 non-null object
# region_1 108724 non-null object
# region_2 50511 non-null object
# taster_name 103727 non-null object
# taster_twitter_handle 98758 non-null object
# title 129971 non-null object
# variety 129970 non-null object
# winery 129971 non-null object
# dtypes: float64(1), int64(1), object(11)
# memory usage: 13.9+ MB
dat.head(5) # View top 5 rows of the data
# country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
# 0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
# 1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
# 2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
# 3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
# 4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
dat.price.describe() # Describe basic statistics for numeric columns
# count 120975.00
# mean 35.36
# std 41.02
# min 4.00
# 25% 17.00
# 50% 25.00
# 75% 42.00
# max 3300.00
# Name: price, dtype: float64
dat.country.describe() # Describe non-numeric (strings) columns
# count 129908
# unique 43
# top US
# freq 54504
# Name: country, dtype: object
The describe() function returns certain information depending on the column type selected (i.e. integers or strings). However, using describe() without selecting any column prior will run describe() on all numeric data types only.
Selecting Columns
# How to find index of a column, using region_2 as example
dat.columns.get_loc('region_2')
# 7
dat.country.head(5) #Select column by name
# 0 Italy
# 1 Portugal
# 2 US
# 3 US
# 4 US
# Name: country, dtype: object
dat['country'].head(5) #Alternative to select column by name
# 0 Italy
# 1 Portugal
# 2 US
# 3 US
# 4 US
# Name: country, dtype: object
dat.iloc[:, 0]. head(5) #Select column by Index location
# 0 Italy
# 1 Portugal
# 2 US
# 3 US
# 4 US
# Name: country, dtype: object
dat.iloc[:, [4, 6, 10]].head(5) # Select multiple columns
# price region_1 title
# 0 NaN Etna Nicosia 2013 Vulkà Bianco (Etna)
# 1 15.0 NaN Quinta dos Avidagos 2011 Avidagos Red (Douro)
# 2 14.0 Willamette Valley Rainstorm 2013 Pinot Gris (Willamette Valley)
# 3 13.0 Lake Michigan Shore St. Julian 2013 Reserve Late Harvest Riesling ...
# 4 65.0 Willamette Valley Sweet Cheeks 2012 Vintner's Reserve Wild Child...
dat.iloc[:, 1:4].head(5)# Select slice of columns
# description designation points
# 0 Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87
# 1 This is ripe and fruity, a wine that is smooth... Avidagos 87
# 2 Tart and snappy, the flavors of lime flesh and... NaN 87
# 3 Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87
# 4 Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87
dat.iloc[:, np.r_[2:4, 9:12]].head(5) # Combine Multiple and Slice
# designation points taster_twitter_handle title variety
# 0 Vulkà Bianco 87 @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend
# 1 Avidagos 87 @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red
# 2 NaN 87 @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris
# 3 Reserve Late Harvest 87 NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling
# 4 Vintner's Reserve Wild Child Block 87 @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir
Filter Rows
dat[dat['country'] == "US"].count() # Filtering String Columns
# country 54504
# description 54504
# designation 36908
# points 54504
# price 54265
# province 54504
# region_1 54226
# region_2 50511
# taster_name 37730
# taster_twitter_handle 34741
# title 54504
# variety 54504
# winery 54504
# dtype: int64
dat[dat['price'] > 500].count() # Filtering Strings
# country 91
# description 91
# designation 47
# points 91
# price 91
# province 91
# region_1 75
# region_2 5
# taster_name 87
# taster_twitter_handle 80
# title 91
# variety 91
# winery 91
# dtype: int64
dat.query('price > 500').count() # Filter by query
# country 91
# description 91
# designation 47
# points 91
# price 91
# province 91
# region_1 75
# region_2 5
# taster_name 87
# taster_twitter_handle 80
# title 91
# variety 91
# winery 91
# dtype: int64
dat.query('(price > 500) & (country == "US")').count() # Filter by more than 1 query
# country 5
# description 5
# designation 5
# points 5
# price 5
# province 5
# region_1 5
# region_2 5
# taster_name 3
# taster_twitter_handle 3
# title 5
# variety 5
# winery 5
# dtype: int64