Data Labeling in Machine Learning with Python: Explore modern ways to prepare labeled data for training and fine-tuning ML and generative AI models / Маркировка данных в машинном обучении с помощью Python: Изучитe современные способы подготовки помеченных данных для обучения и точной настройки моделей ML и генеративного искусственного интеллекта.
Год издания: 2024
Автор: Suda Vijaya Kumar / Суда Виджайя Кумар
Издательство: Packt Publishing
ISBN: 978-1-80461-054-1
Язык: Английский
Формат: PDF, EPUB
Качество: Издательский макет или текст (eBook)
Интерактивное оглавление: Да
Количество страниц: 398
Описание: Take your data preparation, machine learning, and GenAI skills to the next level by learning a range of Python algorithms and tools for data labeling
Key Features:
Generate labels for regression in scenarios with limited training data
Apply generative AI and large language models (LLMs) to explore and label text data
Leverage Python libraries for image, video, and audio data analysis and data labeling
Book Description:
Data labeling is the invisible hand that guides the power of artificial intelligence and machine learning. In today’s data-driven world, mastering data labeling is not just an advantage, it’s a necessity. Data Labeling in Machine Learning with Python empowers you to unearth value from raw data, create intelligent systems, and influence the course of technological evolution.
With this book, you’ll discover the art of employing summary statistics, weak supervision, programmatic rules, and heuristics to assign labels to unlabeled training data programmatically. As you progress, you’ll be able to enhance your datasets by mastering the intricacies of semi-supervised learning and data augmentation. Venturing further into the data landscape, you’ll immerse yourself in the annotation of image, video, and audio data, harnessing the power of Python libraries such as seaborn, matplotlib, cv2, librosa, openai, and langchain. With hands-on guidance and practical examples, you’ll gain proficiency in annotating diverse data types effectively.
By the end of this book, you’ll have the practical expertise to programmatically label diverse data types and enhance datasets, unlocking the full potential of your data.
What You Will Learn:
Excel in exploratory data analysis (EDA) for tabular, text, audio, video, and image data
Understand how to use Python libraries to apply rules to label raw data
Discover data augmentation techniques for adding classification labels
Leverage K-means clustering to classify unsupervised data
Explore how hybrid supervised learning is applied to add labels for classification
Master text data classification with generative AI
Detect objects and classify images with OpenCV and YOLO
Uncover a range of techniques and resources for data annotation
Who this book is for:
This book is for machine learning engineers, data scientists, and data engineers who want to learn data labeling methods and algorithms for model training. Data enthusiasts and Python developers will be able to use this book to learn data exploration and annotation using Python libraries. Basic Python knowledge is beneficial but not necessary to get started.
Поднимите свои навыки подготовки данных, машинного обучения и GenAI на новый уровень, изучив ряд алгоритмов Python и инструментов для маркировки данных
Kлючевые функции:
Генерируйте метки для регрессии в сценариях с ограниченными обучающими данными
Применяйте генеративный искусственный интеллект и большие языковые модели (LLM) для изучения текстовых данных и их маркировки
Используйте библиотеки Python для анализа изображений, видео и аудиоданных, а также для маркировки данных
Описание книги:
Маркировка данных - это невидимая рука, управляющая мощью искусственного интеллекта и машинного обучения. В современном мире, основанном на данных, умение маркировать данные - это не просто преимущество, это необходимость. Маркировка данных в машинном обучении с помощью Python позволяет извлекать пользу из необработанных данных, создавать интеллектуальные системы и влиять на ход технологического развития.
С помощью этой книги вы познакомитесь с искусством использования сводной статистики, слабого контроля, программных правил и эвристики для присвоения меток немаркированным обучающим данным программными средствами. По мере продвижения вы сможете расширять свои наборы данных, осваивая тонкости обучения под наблюдением и увеличения объема данных. Углубляясь в область обработки данных, вы погрузитесь в аннотирование изображений, видео и аудиоданных, используя возможности библиотек Python, таких как seaborn, matplotlib, cv2, librosa, openai и langchain. Благодаря практическим рекомендациям и примерам вы научитесь эффективно создавать аннотации к различным типам данных.
Прочитав эту книгу, вы приобретете практические навыки, позволяющие программно маркировать различные типы данных и расширять наборы данных, полностью раскрывая потенциал ваших данных.
Что вы узнаете:
Преуспеть в исследовательском анализе данных (EDA) для табличных, текстовых, аудио-, видео- и графических данных
Понять, как использовать библиотеки Python для применения правил маркировки необработанных данных
Изучить методы расширения данных для добавления классификационных меток
Используйте кластеризацию K-means для классификации неконтролируемых данных
Изучите, как гибридное обучение с контролем применяется для добавления меток для классификации
Изучите классификацию текстовых данных с помощью generative AI
Обнаруживайте объекты и классифицируйте изображения с помощью OpenCV и YOLO
Раскройте ряд методов и ресурсов для аннотирования данных
Для кого предназначена эта книга:
Эта книга предназначена для инженеров по машинному обучению, специалистов по обработке данных и тех, кто хочет изучить методы маркировки данных и алгоритмы обучения моделей. Любители обработки данных и разработчики на Python смогут использовать эту книгу для изучения анализа данных и создания аннотаций с использованием библиотек Python. Базовые знания Python полезны, но не обязательны для начала работы.
Примеры страниц (скриншоты)
Оглавление
Preface xv
Part 1: Labeling Tabular Data
1
Exploring Data for Machine Learning 3
Technical requirements 4
EDA and data labeling 5
Understanding the ML project
life cycle 6
Defining the business problem 6
Data discovery and data collection 6
Data exploration 7
Data labeling 7
Model training 8
Model evaluation 8
Model deployment 8
Introducing Pandas DataFrames 8
Summary statistics and data
aggregates 12
Summary statistics 13
Data aggregates of the feature for each
target class 14
Creating visualizations using
Seaborn for univariate and bivariate
analysis 15
Univariate analysis 15
Bivariate analysis 21
Profiling data using the
ydata-profiling library 25
Variables section 29
Interactions section 30
Correlations 31
Missing values 33
Sample data 35
Unlocking insights from data with
OpenAI and LangChain 36
Summary 44
2
Labeling Data for Classification 47
Technical requirements 47
Predicting labels with LLMs for
tabular data 48
Data labeling using Snorkel 50
What is Snorkel? 51
Why is Snorkel popular? 52
Loading unlabeled data 53
Creating the labeling functions 53
Labeling rules 53
Constants 54
Labeling functions 54
Creating a label model 56
Predicting labels 57
Labeling data using the Compose
library 58
Labeling data using semi-supervised
learning 59
What is semi-supervised learning? 59
What is pseudo-labeling? 60
Labeling data using K-means
clustering 62
What is unsupervised learning? 62
K-means clustering 63
Inertia 65
Dunn's index 65
Summary 66
3
Labeling Data for Regression 67
Technical requirements 68
Using summary statistics to
generate housing price labels 68
Finding the closest labeled observation to
match the label 69
Using semi-supervised learning to label
regression data 71
Pseudo-labeling 71
Using data augmentation to label
regression data 74
Using k-means clustering to label
regression data 78
Summary 82
Part 2: Labeling Image Data
4
Exploring Image Data 85
Technical requirements 86
Visualizing image data using
Matplotlib in Python 86
Loading the data 88
Checking the dimensions 88
Visualizing the data 88
Checking for outliers 88
Performing data preprocessing 88
Checking for class imbalance 90
Identifying patterns and relationships 91
Evaluating the impact of preprocessing 91
Practice example of visualizing data 91
Practice example for adding annotations
to an image 97
Practice example of image segmentation 98
Practice example for feature extraction 100
Analyzing image size and
aspect ratio 102
Impact of aspect ratios on model
performance 102
Image resizing 104
Image normalization 109
Performing transformations on
images – image augmentation 111
Summary 114
5
Labeling Image Data Using Rules 115
Technical requirements 115
Labeling rules based on image
visualization 116
Image labeling using rules with Snorkel 116
Weak supervision 116
Rules based on the manual visualization
of an image’s object color 117
Real-world applications 118
A practical example of plant disease
detection 120
Labeling images using rules based
on properties 122
Bounding boxes 123
Example 1 – image classification – a bicycle
with and without a person 124
Example 2 – image classification – dog and
cat images 126
Labeling images using transfer
learning 128
Example – digit classification using a pretrained
classifier 128
Example – person image detection using
the YOLO V3 pre-trained classifier 131
Example – bicycle image detection using
the YOLO V3 pre-trained classifier 132
Labeling images using
transformations 132
Summary 133
6
Labeling Image Data Using Data Augmentation 135
Technical requirements 135
Training support vector machines
with augmented image data 136
Kernel trick 137
Data augmentation 137
Image data augmentation 137
Implementing an SVM with data
augmentation in Python 139
Introducing the CIFAR-10 dataset 139
Loading the CIFAR-10 dataset in Python 139
Preprocessing the data for SVM training 140
Implementing an SVM with the default
hyperparameters 141
Evaluating SVM on the original dataset 142
Implementing an SVM with an augmented
dataset 142
Training the SVM on augmented data 143
Evaluating the SVM’s performance on the
augmented dataset 143
Image classification using the SVM
with data augmentation on the
MNIST dataset 144
Convolutional neural networks
using augmented image data 146
How CNNs work 146
Practical example of a CNN using data
augmentation 148
CNN using image data augmentation with
the CIFAR-10 dataset 153
Summary 156
Part 3: Labeling Text, Audio, and Video Data
7
Labeling Text Data 161
Technical requirements 161
Real-world applications of text data
labeling 162
Tools and frameworks for text data
labeling 164
Exploratory data analysis of text 166
Loading the data 166
Understanding the data 166
Cleaning and preprocessing the data 166
Exploring the text’s content 167
Analyzing relationships between text and
other variables 167
Visualizing the results 167
Exploratory data analysis of sample text data
set 167
Exploring Generative AI and
OpenAI for labeling text data 171
GPT models by OpenAI 172
Zero-shot learning capabilities 172
Text classification with OpenAI models 172
Data labeling assistance 172
OpenAI API overview 172
Use case 1 – summarizing the text 173
Use case 2 – topic generation for news
articles 176
Use case 3 – classification of customer
queries using the user-defined categories
and sub-categories 176
Use case 4 – information retrieval using
entity extraction 179
Use case 5 – aspect-based sentiment
analysis 181
Hands-on labeling of text data
using the Snorkel API 182
Hands-on text labeling using
Logistic Regression 189
Hands-on label prediction using
K-means clustering 192
Generating labels for customer
reviews (sentiment analysis) 194
Summary 197
8
Exploring Video Data 199
Technical requirements 200
Loading video data using cv2 200
Extracting frames from video
data for analysis 201
Extracting features from video
frames 202
Color histogram 202
Optical flow features 204
Motion vectors 204
Deep learning features 205
Appearance and shape descriptors 205
Visualizing video data using
Matplotlib 206
Frame visualization 207
Temporal visualization 207
Motion visualization 208
Labeling video data using k-means
clustering 209
Overview of data labeling using
k-means clustering 210
Example of video data labeling using k-means
clustering with a color histogram 210
Advanced concepts in video data
analysis 214
Motion analysis in videos 214
Object tracking in videos 216
Facial recognition in videos 218
Video compression techniques 219
Real-time video processing 220
Video data formats and quality in machine
learning 222
Common issues in handling video data for
ML models 223
Troubleshooting steps 223
Summary 224
9
Labeling Video Data 225
Technical requirements 226
Capturing real-time video 226
Key components and features 227
A hands-on example to capture real-time
video using a webcam 227
Building a CNN model for labeling
video data 228
Using autoencoders for video data
labeling 234
A hands-on example to label video data
using autoencoders 236
Transfer learning 243
Using the Watershed algorithm
for video data labeling 245
A hands-on example to label video data
segmentation using the Watershed
algorithm 245
Computational complexity 250
Performance metrics 250
Real-world examples for video
data labeling 251
Advances in video data labeling
and classification 252
Summary 254
10
Exploring Audio Data 255
Technical requirements 256
Real-life applications for labeling
audio data 256
Audio data fundamentals 259
Hands-on with analyzing audio
data 262
Example code for loading and analyzing
sample audio file 262
Best practices for audio format
conversion 265
Example code for audio data cleaning 266
Extracting properties from audio
data 268
Tempo 268
Chroma features 269
Mel-frequency cepstral coefficients
(MFCCs) 270
Zero-crossing rate 271
Spectral contrast 272
Considerations for extracting properties 273
Visualizing audio data with
matplotlib and Librosa 273
Waveform visualization 274
Loudness visualization 275
Spectrogram visualization 276
Mel spectrogram visualization 277
Considerations for visualizations 280
Ethical implications of audio
data 280
Recent advances in audio data
analysis 282
Troubleshooting common issues
during data analysis 283
Troubleshooting common
installation issues for audio
libraries 285
Summary 287
11
Labeling Audio Data 289
Technical requirements 290
Downloading FFmpeg 291
Azure Machine Learning 29
Real-time voice classification with
Random Forest 291
Transcribing audio using the
OpenAI Whisper model 296
Step 1 – importing the Whisper model 297
Step 2 – loading the base Whisper model 297
Step 3 – setting up FFmpeg 298
Step 4 – transcribing the YouTube audio
using the Whisper model 299
Classifying a transcription using Hugging
Face transformers 300
Hands-on – labeling audio data
using a CNN 300
Exploring audio data
augmentation 307
Introducing Azure Cognitive
Services – the speech service 312
Creating an Azure Speech service 312
Speech to text 313
Speech translation 314
Summary 316
12
Hands-On Exploring Data Labeling Tools 317
Technical requirements 318
Azure Machine Learning data labeling 318
Label Studio 318
pyOpenAnnotate 318
Data labeling using Azure Machine
Learning 318
Benefits of data labeling with Azure Machine
Learning 319
Data labeling steps using Azure Machine
Learning 319
Image data labeling with Azure Machine
Learning 320
Text data labeling with Azure Machine
Learning 334
Audio data labeling using Azure Machine
Learning 342
Integration of the Azure Machine Learning
pipeline with the labeled dataset 345
Exploring Label Studio 346
Labeling the image data 347
Labeling the text data 351
Labeling the video data 352
pyOpenAnnotate 353
Computer Vision Annotation
Tool 355
Comparison of data labeling
tools 356
Advanced methods in data
labeling 357
Active learning 358
Semi-automated labeling 358
Summary 359
Index 361
Other Books You May Enjoy 374