Submitted by MLRecipes t3_10wjenb in MachineLearning
The book has considerably grown since version 1.0. It started with synthetic data as one of the main components, but also diving into explainable AI, intuitive / interpretable machine learning, and generative AI. Now with 272 pages (up from 156 in the first version), the focus is clearly on synthetic data. Of course, I still discuss explainable and generative AI: these concepts are strongly related to data synthetization.
Agent-based modeling in action
However many new chapters have been added, covering various aspects of synthetic data — in particular working with more diversified real datasets, how to synthetize them, how to generate high quality random numbers with a very fast algorithm based on digits of irrational numbers, with visual illustrations and Python code in all chapters. In addition to agent-based modeling newly added, you will find material about
- GAN — generative adversarial networks applied using methods other than neural networks.
- GMM — Gaussian mixture models and alternatives based on multivariate stochastic and lattice processes.
- The Hellinger distance and other metrics to measure the quality of your synthetic data, and the limitations of these metrics.
- The use of copulas with detailed explanations on how it works, Python code, and application to mimicking a real dataset.
- Drawbacks associated with synthetic data, in particular a tendency to replicate algorithm bias that synthetization is supposed to eliminate (and how to avoid this).
- A technique somewhat similar to ensemble methods / tree boosting but specific to data synthetization, to further enhance the value of synthetic data when blended with real data; the goal is to make predictions more robust and applicable to a wider range of observations truly different from those in your original training set.
- Synthetizing nearest neighbor and collision graphs, locally random permutations, shapes, and an introduction to AI-art
Newly added applications include dealing with numerous data types and datasets, including ocean times in Dublin (synthetic time series), temperatures in the Chicago area (geospatial data) and the insurance data set (tabular data). I also included some material from the course that I teach on the subject.
For the time being, the book is available only in PDF format on my e-Store here, with numerous links, backlinks, index, glossary, large bibliography and navigation features to make it easy to browse. This book is a compact yet comprehensive resource on the topic, the first of its kind. The quality of the formatting and color illustrations is unusually high. I plan on adding new books in the future: the next one will be on chaotic dynamical systems with applications. However, the book on synthetic data has been accepted by a major publisher and a print version will be available. But it may take a while before it gets released, and the PDF version has useful features that can not be rendered well in print nor on devices such as Kindle. Once published in the computer science series with the publisher in question, the PDF version may no longer be available. You can check out the content on my GitHub repository, here where the Python code, sample chapters, and datasets also reside.
thiru_2718 t1_j7o82dn wrote
Nice work! There's some intriguing sections here that I definitly want to take a look at.
Quick question, with regards to this quote in the preface: "For instance, regression techniques ... are presented as a single method, without using advanced linear algebra."
Are you referring to Generalized Linear Models? I don't see any references to GLMs, in my brief skim, but I can't think of how else regression can be presented as a single method.
Also, is there any place where we can get a preview of "Shape Classification and Synthetization via Explainable AI" section?