Pandas: Adding/Modifying Columns

Example 1: Lowercasing A Column Of Strings

  • Email addresses are case-insensitive, by law

  • The dataset has them mixed

import pandas as pd

persons = pd.DataFrame({
    'firstname': ['Joerg',                  'Johanna',           'Caro',              'Philipp'          ],
    'lastname':  ['Faschingbauer',          'Faschingbauer',     'Faschingbauer',     'Lichtenberger'    ],

    'email':     ['JF@faschingbauer.co.at', 'Johanna@email.com', 'Caro@email.com',    'PHILIPP@email.com'],

    'age':       [56,                       27,                  25,                  37                 ],
})
persons['email']
0    JF@faschingbauer.co.at
1         Johanna@email.com
2            Caro@email.com
3         PHILIPP@email.com
Name: email, dtype: object

Example 1: Modifying The email Column

  • Pull out email

    email = persons['email']
    
  • Lowercase that, using vectorized string methods of Series

    email.str.lower()
    
    0    jf@faschingbauer.co.at
    1         johanna@email.com
    2            caro@email.com
    3         philipp@email.com
    Name: email, dtype: object
    
    lower_email = email.str.lower()
    
  • Assign back into persons DataFrame

    persons['email'] = lower_email
    
    persons
    
    firstname lastname email age
    0 Joerg Faschingbauer jf@faschingbauer.co.at 56
    1 Johanna Faschingbauer johanna@email.com 27
    2 Caro Faschingbauer caro@email.com 25
    3 Philipp Lichtenberger philipp@email.com 37
  • In short

    persons['email'] = persons['email'].str.lower()
    

Example 2: Adding A normalized_email Column

import pandas as pd

persons = pd.DataFrame({
    'firstname': ['Guido',      'Joerg',                  'Johanna',        'Caro',              'Philipp'],
    'lastname':  ['Rentner',    'Faschingbauer',          'Faschingbauer',  'Faschingbauer',     'Lichtenberger'],
    'email':     ['jf@old.com', 'JF@faschingbauer.co.at', 'Caro@email.com', 'Johanna@email.com', 'PHILIPP@email.com'],
    'age':       [69,           56,                       27,               25,                  37],
})
  • It’s as simple as assigning a column that does not yet exist

    persons['normalized_email'] = persons['email'].str.lower()
    

What If No Prebuilt Functionality Exists? apply() To The Rescue!

  • Simple example: Python’s built-in len() function: one parameter, and return value

    s = 'Hello'
    len(s)
    
    5
    
  • Apply that on a Series; e.g. firstname

    fn = persons['firstname']
    fn
    
    0      Guido
    1      Joerg
    2    Johanna
    3       Caro
    4    Philipp
    Name: firstname, dtype: object
    
  • Length of each firstname

    fn.apply(len)
    
    0    5
    1    5
    2    7
    3    4
    4    7
    Name: firstname, dtype: int64
    

apply() -ing Custom Functions

  • Write single-parameter function (just like len())

    def is_palindrome(s):
        s = s.lower()
        return s == s[::-1]
    
    persons
    
    firstname lastname email age normalized_email
    0 Guido Rentner jf@old.com 69 jf@old.com
    1 Joerg Faschingbauer JF@faschingbauer.co.at 56 jf@faschingbauer.co.at
    2 Johanna Faschingbauer Caro@email.com 27 caro@email.com
    3 Caro Faschingbauer Johanna@email.com 25 johanna@email.com
    4 Philipp Lichtenberger PHILIPP@email.com 37 philipp@email.com
  • Apply it

    persons['lastname'].apply(is_palindrome)
    
    0     True
    1    False
    2    False
    3    False
    4    False
    Name: lastname, dtype: bool