Efficient Dataset Import in Data Management Systems (Software Project)

Organisatorisches

25923988 Supervisor Marcus Pinnecke, M.Sc.
LSF-entry Softwareprojekt Datenbanken und Software Engineering
Credit points 6 CP 
Team work yes (up to 3 students)
Recommended Skills and Knowledge
  • (optional) A course on database systems foundations and internals, i.e., Databases [DB I] and Database Implementation [DB II], or similar 
  • (required) Knowledge on imperative programming in the C languages, i.e., a course on C/C++ [C++], at least in C-like languages such as Java, or similar
  • (required) Knowledge on data structures and algorithm, i.e., Algorithms and Data Structures [AuD] or similar

Current State

For our scientific work, we are developing and researching on a modern hybrid database system, called MondrianDB. The storage engine of MondrianDB, the GridStore, is the component that you will touch during this project.

General
  • GridStore maintains a set of workloads (queries + tables)
    • active workloads are located in main-memory (MM)
    • passive workloads are located on disk
  • Transactions can manipulate tables currently stored in MM
  • Switching between workloads possible
    • the current active workload gets passive
    • a certain passive workload gets active
  • Switching is done via a Workload Switcher component
    • instantly loads tables from disk into MM
    • TableImage sub-system is used
  • A TableImage is a binary dump of a table intended for instant load
    • generic format (i.e., not bound to certain benchmark spec.)
    • Uses GridStore’s internal core API (e.g., fragments and tuplets)
    • TableImage files are generated outside the GridStore
Bildschirmfoto 2017-07-20 um 18.19.05
Current Toolchain
  • The entire process to generate, parse and write a certain benchmark suite (TPC-H) is existing and can be used as prototype for custom implementations
  • The following figure illustrates the current tool chain:

Bildschirmfoto 2017-07-20 um 18.22.50

  • A bash-based script is used to (a) gather and compile the TPC-H dataset generator (gray box), and (b) generate the benchmark data at a certain scale factor using this generator. The result is a set of CSV-like files (*.TBL files) that describe the dataset
  • Afterwards, the dataset is parsed and converted using a tool (called tpch-convert) written by us to achieve the following: (a) converting from plain-text (implict row-wise) table data to a specialized (database system internal) binary format that explictly serializes the dataset column- or row-wise (called TableImage) and stores meta data (such as primary key constraints)
  • This process has the following disadvantages:
    • Specialized only for TPC-H benchmarks
    • Manual effort to explictly map between "text" data, SQL data (cf. TPC-H specs) and internal data types
    • Manual add meta data as given in the TPC-H specs

Project Goal and Software Project Tasks

Future State

  • The goal of this software project is to add table image support for following benchmark tables (no queries)
  • Since all the benchmark mentioned above produces a CSV- or CSV-like set of tables it is reasonable to write a more generalized parser. Thus, the intended requirement to pass this project is to
    • Implement a (generic CSV-/CSV-like) parser
    • Implement a (benchmark-specific) generation script
    • Use existing TableImages API to write the parsed benchmark to table image files.

Future Toolchain

Bildschirmfoto 2017-07-20 um 18.37.57

 

Helpful lecture

  • Brian W. Kernighan, Dennis Ritchie. The C Programming Language. 2. Auflage (7. Februar 2000). Markt+Technik Verlag. ISBN 978-0131103627
  • Ben Klemens. C im 21. Jahrhundert. 2. Auflage (28. März 2014). O'Reilly Verlag GmbH & Co. KG. ISBN 978-3955616922
  • David Hanson, David R. Hanson. C Interfaces and Implementations: Techniques for Creating Reusable Software. 1. Auflage (2. Januar 1997). Pearson Education. ISBN 978-0201498417
  • Heinz Peter Gumm, Manfred Sommer. Einführung in die Informatik. 10. Auflage (1. Januar 2013). De Gruyter Oldenbourg. ISBN 978-3486706413
  • Alfons Kemper, André Eickler. Datenbanksysteme: Eine Einführung. 9. Auflage (26. September 2013). De Gruyter Oldenbourg. ISBN 978-3486721393
  • Gunter Saake, Kai-Uwe Sattler, Andreas Heuer. Datenbanken: Implementierungstechniken. 3. Auflage (10. November 2011). ISBN 978-3826691560

Letzte Änderung: 27.10.2017 - Ansprechpartner:

Sie können eine Nachricht versenden an: Webmaster
Sicherheitsabfrage:
Captcha
 
Lösung: