How we used vectorization for 1000x Python speedups (no C or Spark needed!)
- Track:
- PyData: Machine Learning, Stats
- Type:
- Talk (long session)
- Level:
- intermediate
- Room:
- Forum Hall
- Start:
- 16:05 on 11 July 2024
- Duration:
- 45 minutes
Abstract
Want to make all your code faster? With matrices, library knowledge, and a sprinkle of creativity, you can consistently speed up multivariate Python functions by 1000x!
Modal optimization requires simple axioms - arithmetic, checking a case, calling the right sklearn function, and so on. When that’s not sufficient, three core tricks - converting conditional logic to set theory, stacking vectors into a matrix, and shaping data to match library expectations - cover the vast majority of real world cases (90% of the ~400 functions we vectorized).
At Bloomberg, ESG (Environmental, Social, and Governance) Scores require complex computations on large data sets. Time-series computations are fundamental for Governance - one UDF infers board support for a policy from prior cyclical votes and other time offset inputs. By rewriting the pandas backfill as a series of reductions on a 4-tensor, we reduced the runtime from 45 minutes to 10 milliseconds! Analogously, due to real world complexity, finance UDFs can end up with 100+ if/else branches in one function. With a mix of De Morgan’s laws and sparse matrix representations, we simplified the cases and achieved 1000x+ speedups.
We’ll conclude with a quick overview of cutting-edge tools, and hope you’ll leave with a concrete strategy for vectorizing financial models!