FUSTAL: A parallel statistics library in futhark
Level of Education of Students Involved
Arts and Sciences
Computer Science, Data Science, Statistics
This project explores using the data-parallel functional programming language Futhark to develop the core of a general purpose statistics library. Additionally, this project seeks to show that the strict numeric-only model that futhark enforces, combined with the highly optimized compiler that makes operations that can be parallel into parallel code, lends itself well to implementing statistical tests. The nature of the uses cases for statistical tools required the creation of a testsuite to validate on mathematical correctness. The statistical correctness was paramount because of the library’s goal of serving as a computational backend for Python & R users. This validation occurred by using the iris dataset from R due to most statisticians and data scientists familiarity with it. It also provides enough data to run many different tests. The currently implemented tests include 1 & 2 sample T tests, Pearson correlation coefficient, and F Test statistic for a 1 way ANOVA. A function to calculate the alpha and beta values for a simple linear regression is also implemented. Initial testing against industry standard solutions, such as R, show promise in terms of performance and accuracy. When running the testsuite, even when compiled to run sequentially, is often markedly faster than R. This is even in spite of the fact that the library is using 64 bit floating point values compared to R’s 32 bit. More testing is needed to see how the size of datasets changes the overall performance, but results are expected to be on par with R.
Hawk, Ethan, "FUSTAL: A parallel statistics library in futhark" (2023). Symposium on Undergraduate Research and Creative Expression (SOURCE). 1177.