Pandas: Basics (`DataFrame` And `Series`)#

Naive: Objects, And Collections Of Objects #

Person object, represented as naive dictionary in Python

joerg = {
    'firstname': 'Joerg',
    'lastname': 'Faschingbauer',
    'email': 'jf@faschingbauer.co.at',
    'age': 56,
}

Again, naive collection of persons: native Python list

caro = {
    'firstname': 'Caro',
    'lastname': 'Faschingbauer',
    'email': 'caro@email.com',
    'age': 25,
}
persons = [joerg, caro]
persons

[{'firstname': 'Joerg',
  'lastname': 'Faschingbauer',
  'email': 'jf@faschingbauer.co.at',
  'age': 56},
 {'firstname': 'Caro',
  'lastname': 'Faschingbauer',
  'email': 'caro@email.com',
  'age': 25}]

Inverted: Objects, And Collections Of Objects (⟶ `DataFrame`)#

Pandas DataFrame ist different
… analogous to a dictionary that contains database columns

persons = {
    'firstname': ['Joerg',                  'Johanna',           'Caro',              'Philipp'          ],
    'lastname':  ['Faschingbauer',          'Faschingbauer',     'Faschingbauer',     'Lichtenberger'    ],
    'email':     ['jf@faschingbauer.co.at', 'johanna@email.com', 'caro@email.com',    'philipp@email.com'],
    'age':       [56,                       27,                  25,                  37                 ],
}

Operation: column selection

persons['firstname']

['Joerg', 'Johanna', 'Caro', 'Philipp']

persons['age']

[56, 27, 25, 37]

Operation: aggregation

sum(persons['age'])

Enter `pandas`, `DataFrame`, `Series`#

import pandas as pd

Native Python dictionaries are not efficient enough
Native Python dictionaries are feature-rich enough
Mixing of data types inside a list/column
Pandas uses NumPy internally ⟶ values inside one column (Series) have same type

persons = pd.DataFrame(persons)

persons

	firstname	lastname	email	age
0	Joerg	Faschingbauer	jf@faschingbauer.co.at	56
1	Johanna	Faschingbauer	johanna@email.com	27
2	Caro	Faschingbauer	caro@email.com	25
3	Philipp	Lichtenberger	philipp@email.com	37

Note the index column

persons.shape

(4, 4)

Selecting A Column ⟶ `Series`#

Just like a Python dictionary: index operator []

persons.columns

Index(['firstname', 'lastname', 'email', 'age'], dtype='object')

persons['firstname']

    Joerg
  Johanna
     Caro
  Philipp
Name: firstname, dtype: object

type(persons['firstname'])

pandas.core.series.Series

persons['firstname'].iloc[0]

'Joerg'

Selecting Multiple Columns #

Unlike Python dictionary: using index operator with a list of column names

persons[['firstname', 'age']]

	firstname	age
0	Joerg	56
1	Johanna	27
2	Caro	25
3	Philipp	37

type(persons[['firstname', 'age']])

pandas.core.frame.DataFrame

Note

One would wish that slicing works, just as with loc and iloc (see Pandas: Selecting Rows (And Columns) With iloc[] and Pandas: Selecting Rows (And Columns) With loc[]):

persons['firstname':'age']

Unfortunately this does not work.

To Copy Or Not To Copy #

Working on large datasets (i.e. that take a long time to load)
One does not want to make irreversible changes
⟶ make a backup copy before trying around

persons2 = persons.copy()

Or use inplace=False (which is the default when that parameter exists)

Pandas: Basics (DataFrame And Series)#

Naive: Objects, And Collections Of Objects#

Inverted: Objects, And Collections Of Objects (⟶ DataFrame)#

Enter pandas, DataFrame, Series#

Selecting A Column ⟶ Series#

Selecting Multiple Columns#

To Copy Or Not To Copy#

Links#

Pandas: Basics (`DataFrame` And `Series`)#

Naive: Objects, And Collections Of Objects #

Inverted: Objects, And Collections Of Objects (⟶ `DataFrame`)#

Enter `pandas`, `DataFrame`, `Series`#

Selecting A Column ⟶ `Series`#

Selecting Multiple Columns #

To Copy Or Not To Copy #

Links #