Pandas: Basics (DataFrame And Series)

Naive: Objects, And Collections Of Objects

  • Person object, represented as naive dictionary in Python

joerg = {
    'firstname': 'Joerg',
    'lastname': 'Faschingbauer',
    'email': 'jf@faschingbauer.co.at',
    'age': 56,
}
  • Again, naive collection of persons: native Python list

caro = {
    'firstname': 'Caro',
    'lastname': 'Faschingbauer',
    'email': 'caro@email.com',
    'age': 25,
}
persons = [joerg, caro]
persons
[{'firstname': 'Joerg',
  'lastname': 'Faschingbauer',
  'email': 'jf@faschingbauer.co.at',
  'age': 56},
 {'firstname': 'Caro',
  'lastname': 'Faschingbauer',
  'email': 'caro@email.com',
  'age': 25}]

Inverted: Objects, And Collections Of Objects (⟶ DataFrame)

  • Pandas DataFrame ist different

  • … analogous to a dictionary that contains database columns

persons = {
    'firstname': ['Joerg',                  'Johanna',           'Caro',              'Philipp'          ],
    'lastname':  ['Faschingbauer',          'Faschingbauer',     'Faschingbauer',     'Lichtenberger'    ],
    'email':     ['jf@faschingbauer.co.at', 'johanna@email.com', 'caro@email.com',    'philipp@email.com'],
    'age':       [56,                       27,                  25,                  37                 ],
}
  • Operation: column selection

persons['firstname']
['Joerg', 'Johanna', 'Caro', 'Philipp']
persons['age']
[56, 27, 25, 37]
  • Operation: aggregation

sum(persons['age'])
145

Enter pandas, DataFrame, Series

import pandas as pd
  • Native Python dictionaries are not efficient enough

  • Native Python dictionaries are feature-rich enough

  • Mixing of data types inside a list/column

  • Pandas uses NumPy internally ⟶ values inside one column (Series) have same type

persons = pd.DataFrame(persons)
persons
firstname lastname email age
0 Joerg Faschingbauer jf@faschingbauer.co.at 56
1 Johanna Faschingbauer johanna@email.com 27
2 Caro Faschingbauer caro@email.com 25
3 Philipp Lichtenberger philipp@email.com 37
  • Note the index column

persons.shape
(4, 4)

Selecting A Column ⟶ Series

  • Just like a Python dictionary: index operator []

persons.columns
Index(['firstname', 'lastname', 'email', 'age'], dtype='object')
persons['firstname']
0      Joerg
1    Johanna
2       Caro
3    Philipp
Name: firstname, dtype: object
type(persons['firstname'])
pandas.core.series.Series
persons['firstname'].iloc[0]
'Joerg'

Selecting Multiple Columns

  • Unlike Python dictionary: using index operator with a list of column names

persons[['firstname', 'age']]
firstname age
0 Joerg 56
1 Johanna 27
2 Caro 25
3 Philipp 37
type(persons[['firstname', 'age']])
pandas.core.frame.DataFrame

Note

One would wish that slicing works, just as with loc and iloc (see Pandas: Selecting Rows (And Columns) With iloc[] and Pandas: Selecting Rows (And Columns) With loc[]):

persons['firstname':'age']

Unfortunately this does not work.

To Copy Or Not To Copy

  • Working on large datasets (i.e. that take a long time to load)

  • One does not want to make irreversible changes

  • ⟶ make a backup copy before trying around

persons2 = persons.copy()
  • Or use inplace=False (which is the default when that parameter exists)