Whetting Your Appetite (Livehacking Demo)

Pipeline Driver

  • Read CSV

  • Read pipe members; drive empty pipeline

  • Print frame

  • DataFrame

#!/usr/bin/env python

import sys
import pandas


csvname = sys.argv[1]
pipe_stages = sys.argv[2:]

data = pandas.read_csv(
    csvname, 
    delimiter=';', encoding='iso-8859-1', 
    names=('account', 'info', 'time_booked', 'time_valuta', 'amount', 'unit'))


for ps in pipe_stages:
    context = {
        'pd': pandas,
    }
    exec('import numpy as np', context)

    code = open(ps).read()
    exec(code, context)

    transform = context['transform']

    data = transform(data)


pandas.options.display.max_colwidth = None
pandas.options.display.max_columns = None
pandas.options.display.max_rows = None
pandas.options.display.width = None

print(data)

Won’t work though …

$ ./drive-pipeline.py
Traceback (most recent call last):
  File "/home/jfasch/Homebrain/Firma/Kunden/039-IT-Visions/2023-03-13--Python-SAP--Consolut/Demo/./drive-pipeline.py", line 4, in <module>
    import pandas as pd
ModuleNotFoundError: No module named 'pandas'

Virtual Environment Setup

  • Modules from Python standard library

  • Modules that are not in Python standard library (like pandas, for example)

  • Sandboxing

    • Python interpreter (and standard library) version

    • External module version

  • Jupyter notebook? Sure!

Filter: Add Category: card-payment

  • info.startswith('Bezahlung Karte')

  • Add column category, containing value card-payment only

def transform(data):
    data['category'] = data['info'].str.startswith('Bezahlung Karte')
    return data
  • Pandas vectorized string methods

  • Series

  • Modeled after Python’s built-in str methods (only on a Series instead``)

  • Insufficient: adds only bool column

def categorize(info):
    if info.startswith('Bezahlung Karte'):
        return 'card-payment'
    else:
        return 'unknown'

def transform(data):
    data['category'] = data['info'].apply(categorize)
    return data
  • apply()

  • ⟶ generic way to hook-in custom fuctionality

  • ⟶ enter real Python programming

Filter: Select Uncategorized

def transform(data):
    filt_uncat = data['category'] == 'unknown'
    uncat_rows = data.loc[filt_uncat]

    return uncat_rows
  • Hiccup: duplicating the string unknown across (at least) two different filters/files

More Categories

  • card-payment is far too unspecific

  • Useless: want “Food”, “Car”, “Luxury”, …

def categorize(info):
    if info.startswith('Bezahlung Karte'):
        return categorize_card_payment(info)
    return 'unknown'

def categorize_card_payment(info):
    fields = info.split('|')
    which = fields[0]
    pos = fields[1]
    company = fields[2]

    if company.startswith('SPAR DANKT'):
        return 'living'
    if company.startswith('JET'):
        return 'car'

    return 'card-unknown'


def transform(data):
    data['category'] = data['info'].apply(categorize)
    return data
  • Heavily modified though

  • Python programming

  • Split the info field

    • Manually unpacking fields, first

    • ⟶ tuple unpacking

  • Interpret fields

  • Guess category

  • Into the wild

    • Working with crap data

    • Date formats

    • Floating point/currency formats and units (EUR?!)

    • Field tunneling (info has three fields, but not always)

  • Uncertainty!

  • Fear!!

  • ⟶ Testing!!!

Testing

  • Modularize

    • Externalize stuff from filters/categorize_v1.py

    • filters/categorize_v2.py

      import stuff.category              # <-- use code from stuff/category.py
      
      def transform(data):
          data['category'] = data['info'].apply(stuff.category.categorize)
          return data
      
    • Import from stuff/category.py

      def categorize(info):
          if info.startswith('Bezahlung Karte'):
              return categorize_card_payment(info)
          return 'unknown'
      
      def categorize_card_payment(info):
          fields = info.split('|')
          which = fields[0]
          pos = fields[1]
          company = fields[2]
      
          if company.startswith('SPAR DANKT'):
              return 'living'
          if company.startswith('JET'):
              return 'car'
      
          return 'card-unknown'
      
      
    • See if still works ⟶ ok

  • Add second importer: test

    • Problems

      • Primary problem: finding a category based upon the info field (a str`)

      • Secondary (if at all): reading CSV

    • Unit test: tests/test_category.py

      • Test only the info column (⟶ raw strings)

      • Minor hiccup: have to set PYTHONPATH (see here)

      from stuff.category import categorize
      
      def test_basic():
          info = r'Bezahlung Karte                              MC/000009258|POS          2800  K002 07.02. 12:34|SPAR DANKT 5362\\GRAZ\8020'
          cat = categorize(info)
          assert cat == 'living'
      
    $ pytest
    ======================================= test session starts =======================================
    platform linux -- Python 3.10.7, pytest-7.2.0, pluggy-1.0.0
    rootdir: /home/jfasch/work/jfasch-home/trainings/log/detail/2023-03-13-Python-SAP/Demo
    collected 1 item
    
    tests/test_category.py .                                                                    [100%]
    ======================================== 1 passed in 0.01s ========================================