Whetting Your Appetite (Livehacking Demo)¶

Pipeline Driver ¶

Read CSV
Read pipe members; drive empty pipeline
Print frame
⟶ DataFrame

Demo/drive-pipeline.py¶

#!/usr/bin/env python

import sys
import pandas


csvname = sys.argv[1]
pipe_stages = sys.argv[2:]

data = pandas.read_csv(
    csvname, 
    delimiter=';', encoding='iso-8859-1', 
    names=('account', 'info', 'time_booked', 'time_valuta', 'amount', 'unit'))


for ps in pipe_stages:
    context = {
        'pd': pandas,
    }
    exec('import numpy as np', context)

    code = open(ps).read()
    exec(code, context)

    transform = context['transform']

    data = transform(data)


pandas.options.display.max_colwidth = None
pandas.options.display.max_columns = None
pandas.options.display.max_rows = None
pandas.options.display.width = None

print(data)

Won’t work though …

$ ./drive-pipeline.py
Traceback (most recent call last):
  File "/home/jfasch/Homebrain/Firma/Kunden/039-IT-Visions/2023-03-13--Python-SAP--Consolut/Demo/./drive-pipeline.py", line 4, in <module>
    import pandas as pd
ModuleNotFoundError: No module named 'pandas'

Virtual Environment Setup ¶

Modules from Python standard library
Modules that are not in Python standard library (like pandas, for example)
Sandboxing
- Python interpreter (and standard library) version
- External module version
Jupyter notebook? Sure!

Filter: Add Category: `card-payment`¶

info.startswith('Bezahlung Karte')
Add column category, containing value card-payment only

Demo/filters/cat-card-payment-v1.py¶

def transform(data):
    data['category'] = data['info'].str.startswith('Bezahlung Karte')
    return data

Pandas vectorized string methods
⟶ Series
Modeled after Python’s built-in str methods (only on a Series instead``)
Insufficient: adds only bool column

Demo/filters/cat-card-payment-v2.py¶

def categorize(info):
    if info.startswith('Bezahlung Karte'):
        return 'card-payment'
    else:
        return 'unknown'

def transform(data):
    data['category'] = data['info'].apply(categorize)
    return data

apply()
⟶ generic way to hook-in custom fuctionality
⟶ enter real Python programming

Filter: Select Uncategorized ¶

Demo/filters/uncategorized.py¶

def transform(data):
    filt_uncat = data['category'] == 'unknown'
    uncat_rows = data.loc[filt_uncat]

    return uncat_rows

Hiccup: duplicating the string unknown across (at least) two different filters/files

More Categories ¶

card-payment is far too unspecific
Useless: want “Food”, “Car”, “Luxury”, …

Demo/filters/categorize-v1.py¶

def categorize(info):
    if info.startswith('Bezahlung Karte'):
        return categorize_card_payment(info)
    return 'unknown'

def categorize_card_payment(info):
    fields = info.split('|')
    which = fields[0]
    pos = fields[1]
    company = fields[2]

    if company.startswith('SPAR DANKT'):
        return 'living'
    if company.startswith('JET'):
        return 'car'

    return 'card-unknown'


def transform(data):
    data['category'] = data['info'].apply(categorize)
    return data

Heavily modified though
Python programming

Split the info field
- Manually unpacking fields, first
- ⟶ tuple unpacking
Interpret fields
Guess category
Into the wild
- Working with crap data
- Date formats
- Floating point/currency formats and units (EUR?!)
- Field tunneling (info has three fields, but not always)
Uncertainty!
Fear!!
⟶ Testing!!!

Testing ¶

Modularize

Externalize stuff from filters/categorize_v1.py

⟶ filters/categorize_v2.py

Demo/filters/categorize-v2.py¶

import stuff.category              # <-- use code from stuff/category.py

def transform(data):
    data['category'] = data['info'].apply(stuff.category.categorize)
    return data

Import from stuff/category.py

Demo/stuff/category.py¶

def categorize(info):
    if info.startswith('Bezahlung Karte'):
        return categorize_card_payment(info)
    return 'unknown'

def categorize_card_payment(info):
    fields = info.split('|')
    which = fields[0]
    pos = fields[1]
    company = fields[2]

    if company.startswith('SPAR DANKT'):
        return 'living'
    if company.startswith('JET'):
        return 'car'

    return 'card-unknown'

See if still works ⟶ ok

Add second importer: test

Problems
- Primary problem: finding a category based upon the info field (a str`)
- Secondary (if at all): reading CSV

Unit test: tests/test_category.py

Test only the info column (⟶ raw strings)
Minor hiccup: have to set PYTHONPATH (see here)

Demo/tests/test_category.py¶

from stuff.category import categorize

def test_basic():
    info = r'Bezahlung Karte                              MC/000009258|POS          2800  K002 07.02. 12:34|SPAR DANKT 5362\\GRAZ\8020'
    cat = categorize(info)
    assert cat == 'living'

$ pytest
======================================= test session starts =======================================
platform linux -- Python 3.10.7, pytest-7.2.0, pluggy-1.0.0
rootdir: /home/jfasch/work/jfasch-home/trainings/log/detail/2023-03-13-Python-SAP/Demo
collected 1 item

tests/test_category.py .                                                                    [100%]
======================================== 1 passed in 0.01s ========================================

Whetting Your Appetite (Livehacking Demo)¶

Pipeline Driver¶

Virtual Environment Setup¶

Filter: Add Category: card-payment¶

Filter: Select Uncategorized¶

More Categories¶

Testing¶

Pipeline Driver ¶

Virtual Environment Setup ¶

Filter: Add Category: `card-payment`¶

Filter: Select Uncategorized ¶

More Categories ¶

Testing ¶