Pandas: Indexes

import pandas as pd

persons = pd.DataFrame({
    'firstname': ['Joerg',                  'Johanna',           'Caro',              'Philipp'          ],
    'lastname':  ['Faschingbauer',          'Faschingbauer',     'Faschingbauer',     'Lichtenberger'    ],
    'email':     ['jf@faschingbauer.co.at', 'johanna@email.com', 'caro@email.com',    'philipp@email.com'],
    'age':       [56,                       27,                  25,                  37                 ],
})

Default Index: Row Number

persons
firstname lastname email age
0 Joerg Faschingbauer jf@faschingbauer.co.at 56
1 Johanna Faschingbauer johanna@email.com 27
2 Caro Faschingbauer caro@email.com 25
3 Philipp Lichtenberger philipp@email.com 37
  • See how rows are numbered

  • No column name given

  • ⟶ default index

persons.index
RangeIndex(start=0, stop=4, step=1)

Setting Custom Index

  • Notice how email appears to be unique

  • ⟶ could be used as an index

    persons.set_index('email')
    
    firstname lastname age
    email
    jf@faschingbauer.co.at Joerg Faschingbauer 56
    johanna@email.com Johanna Faschingbauer 27
    caro@email.com Caro Faschingbauer 25
    philipp@email.com Philipp Lichtenberger 37
  • This does not change anything

  • Returns modified copy (could be assigned to another variable that you continue to work with, for example)

  • persons is still the same as before

    persons
    
    firstname lastname email age
    0 Joerg Faschingbauer jf@faschingbauer.co.at 56
    1 Johanna Faschingbauer johanna@email.com 27
    2 Caro Faschingbauer caro@email.com 25
    3 Philipp Lichtenberger philipp@email.com 37

Setting Custom Index, inplace=True

  • Many (but not all) DataFrame methods support an inplace parameter

  • Default False

    • ⟶ no change

    • Returns a modified copy of the DataFrame object

  • Nice for trying around on a large dataset that we don’t want to damage

  • Add inplace if everything works

  • ⟶ No return value

    persons.set_index('email', inplace=True)
    
  • Modified object in-place

    persons
    
    firstname lastname age
    email
    jf@faschingbauer.co.at Joerg Faschingbauer 56
    johanna@email.com Johanna Faschingbauer 27
    caro@email.com Caro Faschingbauer 25
    philipp@email.com Philipp Lichtenberger 37
  • Index has changed

    persons.index
    
    Index(['jf@faschingbauer.co.at', 'johanna@email.com', 'caro@email.com',
           'philipp@email.com'],
          dtype='object', name='email')
    

Custom Index, And loc[]

  • loc[] selects by row label (⟶ index)

  • Row labels are not row numbers anymore ⟶ cannot be used as row labels

    persons.loc[0]
    
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    File ~/My-Environments/jfasch-home/lib64/python3.12/site-packages/pandas/core/indexes/base.py:3791, in Index.get_loc(self, key)
       3790 try:
    -> 3791     return self._engine.get_loc(casted_key)
       3792 except KeyError as err:
    
    File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc()
    
    File index.pyx:181, in pandas._libs.index.IndexEngine.get_loc()
    
    File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()
    
    File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
    
    KeyError: 0
    
    The above exception was the direct cause of the following exception:
    
    KeyError                                  Traceback (most recent call last)
    Cell In[9], line 1
    ----> 1 persons.loc[0]
    
    File ~/My-Environments/jfasch-home/lib64/python3.12/site-packages/pandas/core/indexing.py:1153, in _LocationIndexer.__getitem__(self, key)
       1150 axis = self.axis or 0
       1152 maybe_callable = com.apply_if_callable(key, self.obj)
    -> 1153 return self._getitem_axis(maybe_callable, axis=axis)
    
    File ~/My-Environments/jfasch-home/lib64/python3.12/site-packages/pandas/core/indexing.py:1393, in _LocIndexer._getitem_axis(self, key, axis)
       1391 # fall thru to straight lookup
       1392 self._validate_key(key, axis)
    -> 1393 return self._get_label(key, axis=axis)
    
    File ~/My-Environments/jfasch-home/lib64/python3.12/site-packages/pandas/core/indexing.py:1343, in _LocIndexer._get_label(self, label, axis)
       1341 def _get_label(self, label, axis: AxisInt):
       1342     # GH#5567 this will fail if the label is not present in the axis.
    -> 1343     return self.obj.xs(label, axis=axis)
    
    File ~/My-Environments/jfasch-home/lib64/python3.12/site-packages/pandas/core/generic.py:4236, in NDFrame.xs(self, key, axis, level, drop_level)
       4234             new_index = index[loc]
       4235 else:
    -> 4236     loc = index.get_loc(key)
       4238     if isinstance(loc, np.ndarray):
       4239         if loc.dtype == np.bool_:
    
    File ~/My-Environments/jfasch-home/lib64/python3.12/site-packages/pandas/core/indexes/base.py:3798, in Index.get_loc(self, key)
       3793     if isinstance(casted_key, slice) or (
       3794         isinstance(casted_key, abc.Iterable)
       3795         and any(isinstance(x, slice) for x in casted_key)
       3796     ):
       3797         raise InvalidIndexError(key)
    -> 3798     raise KeyError(key) from err
       3799 except TypeError:
       3800     # If we have a listlike key, _check_indexing_error will raise
       3801     #  InvalidIndexError. Otherwise we fall through and re-raise
       3802     #  the TypeError.
       3803     self._check_indexing_error(key)
    
    KeyError: 0
    
  • New row label: email

    persons.loc['jf@faschingbauer.co.at']
    
    firstname            Joerg
    lastname     Faschingbauer
    age                     56
    Name: jf@faschingbauer.co.at, dtype: object
    
    persons.loc[['jf@faschingbauer.co.at', 'johanna@email.com']]
    
    firstname lastname age
    email
    jf@faschingbauer.co.at Joerg Faschingbauer 56
    johanna@email.com Johanna Faschingbauer 27

Custom Index, And iloc[]

  • iloc[] selects by row number

  • ⟶ still valid as before

persons.iloc[0]
firstname            Joerg
lastname     Faschingbauer
age                     56
Name: jf@faschingbauer.co.at, dtype: object
persons.iloc[[0, 1]]
firstname lastname age
email
jf@faschingbauer.co.at Joerg Faschingbauer 56
johanna@email.com Johanna Faschingbauer 27

Sorting DataFrame Object By Index Column

  • DataFrame.sort_index(): non inplace by default ⟶ returns modified copy

    persons.sort_index(ascending=True)
    
    firstname lastname age
    email
    caro@email.com Caro Faschingbauer 25
    jf@faschingbauer.co.at Joerg Faschingbauer 56
    johanna@email.com Johanna Faschingbauer 27
    philipp@email.com Philipp Lichtenberger 37
  • Sorting in place

    persons.sort_index(ascending=True, inplace=True)
    
    persons
    
    firstname lastname age
    email
    caro@email.com Caro Faschingbauer 25
    jf@faschingbauer.co.at Joerg Faschingbauer 56
    johanna@email.com Johanna Faschingbauer 27
    philipp@email.com Philipp Lichtenberger 37