+353-1-416-8900REST OF WORLD
+44-20-3973-8888REST OF WORLD
1-917-300-0470EAST COAST U.S
1-800-526-8630U.S. (TOLL FREE)

Data Wrangling. Concepts, Applications and Tools. Edition No. 1

  • Book

  • 368 Pages
  • June 2023
  • John Wiley and Sons Ltd
  • ID: 5829839

DATA WRANGLING

Written and edited by some of the world's top experts in the field, this exciting new volume provides state-of-the-art research and latest technological breakthroughs in data wrangling, its theoretical concepts, practical applications, and tools for solving everyday problems.

Data wrangling is the process of cleaning and unifying messy and complex data sets for easy access and analysis. This process typically includes manually converting and mapping data from one raw form into another format to allow for more convenient consumption and organization of the data. Data wrangling is increasingly ubiquitous at today’s top firms.

Data cleaning focuses on removing inaccurate data from your data set whereas data wrangling focuses on transforming the data's format, typically by converting "raw" data into another format more suitable for use. Data wrangling is a necessary component of any business. Data wrangling solutions are specifically designed and architected to handle diverse, complex data at any scale, including many applications, such as Datameer, Infogix, Paxata, Talend, Tamr, TMMData, and Trifacta.

This book synthesizes the processes of data wrangling into a comprehensive overview, with a strong focus on recent and rapidly evolving agile analytic processes in data-driven enterprises, for businesses and other enterprises to use to find solutions for their everyday problems and practical applications. Whether for the veteran engineer, scientist, or other industry professional, this book is a must have for any library.

Table of Contents

1 Basic Principles of Data Wrangling 1
Akshay Singh, Surender Singh and Jyotsna Rathee

1.1 Introduction 2

1.2 Data Workflow Structure 4

1.3 Raw Data Stage 4

1.3.1 Data Input 5

1.3.2 Output Actions at Raw Data Stage 6

1.3.3 Structure 6

1.3.4 Granularity 7

1.3.5 Accuracy 7

1.3.6 Temporality 8

1.3.7 Scope 8

1.4 Refined Stage 9

1.4.1 Data Design and Preparation 9

1.4.2 Structure Issues 9

1.4.3 Granularity Issues 10

1.4.4 Accuracy Issues 10

1.4.5 Scope Issues 11

1.4.6 Output Actions at Refined Stage 11

1.5 Produced Stage 12

1.5.1 Data Optimization 13

1.5.2 Output Actions at Produced Stage 13

1.6 Steps of Data Wrangling 14

1.7 Do’s for Data Wrangling 16

1.8 Tools for Data Wrangling 16

References 17

2 Skills and Responsibilities of Data Wrangler 19
Prabhjot Kaur, Anupama Kaushik and Aditya Kapoor

2.1 Introduction 20

2.2 Role as an Administrator (Data and Database) 21

2.3 Skills Required 22

2.3.1 Technical Skills 22

2.3.1.1 Python 22

2.3.1.2 R Programming Language 25

2.3.1.3 Sql 26

2.3.1.4 MATLAB 27

2.3.1.5 Scala 27

2.3.1.6 Excel 28

2.3.1.7 Tableau 28

2.3.1.8 Power BI 29

2.3.2 Soft Skills 31

2.3.2.1 Presentation Skills 31

2.3.2.2 Storytelling 32

2.3.2.3 Business Insights 32

2.3.2.4 Writing/Publishing Skills 32

2.3.2.5 Listening 33

2.3.2.6 Stop and Think 33

2.3.2.7 Soft Issues 33

2.4 Responsibilities as Database Administrator 34

2.4.1 Software Installation and Maintenance 34

2.4.2 Data Extraction, Transformation, and Loading 34

2.4.3 Data Handling 35

2.4.4 Data Security 35

2.4.5 Data Authentication 35

2.4.6 Data Backup and Recovery 35

2.4.7 Security and Performance Monitoring 36

2.4.8 Effective Use of Human Resource 36

2.4.9 Capacity Planning 36

2.4.10 Troubleshooting 36

2.4.11 Database Tuning 36

2.5 Concerns for a DBA 37

2.6 Data Mishandling and Its Consequences 39

2.6.1 Phases of Data Breaching 40

2.6.2 Data Breach Laws 41

2.6.3 Best Practices For Enterprises 41

2.7 The Long-Term Consequences: Loss of Trust and Diminished Reputation 42

2.8 Solution to the Problem 42

2.9 Case Studies 42

2.9.1 UBER Case Study 42

2.9.1.1 Role of Analytics and Business Intelligence in Optimization 44

2.9.1.2 Mapping Applications for City Ops Teams 46

2.9.1.3 Marketplace Forecasting 47

2.9.1.4 Learnings from Data 48

2.9.2 PepsiCo Case Study 48

2.9.2.1 Searching for a Single Source of Truth 49

2.9.2.2 Finding the Right Solution for Better Data 49

2.9.2.3 Enabling Powerful Results with Self-Service Analytics 50

2.10 Conclusion 50

References 50

3 Data Wrangling Dynamics 53
Simarjit Kaur, Anju Bala and Anupam Garg

3.1 Introduction 53

3.2 Related Work 54

3.3 Challenges: Data Wrangling 55

3.4 Data Wrangling Architecture 56

3.4.1 Data Sources 57

3.4.2 Auxiliary Data 57

3.4.3 Data Extraction 58

3.4.4 Data Wrangling 58

3.4.4.1 Data Accessing 58

3.4.4.2 Data Structuring 58

3.4.4.3 Data Cleaning 58

3.4.4.4 Data Enriching 59

3.4.4.5 Data Validation 59

3.4.4.6 Data Publication 59

3.5 Data Wrangling Tools 59

3.5.1 Excel 59

3.5.2 Altair Monarch 60

3.5.3 Anzo 60

3.5.4 Tabula 61

3.5.5 Trifacta 61

3.5.6 Datameer 63

3.5.7 Paxata 63

3.5.8 Talend 65

3.6 Data Wrangling Application Areas 65

3.7 Future Directions and Conclusion 67

References 68

4 Essentials of Data Wrangling 71
Menal Dahiya, Nikita Malik and Sakshi Rana

4.1 Introduction 71

4.2 Holistic Workflow Framework for Data Projects 72

4.2.1 Raw Stage 73

4.2.2 Refined Stage 74

4.2.3 Production Stage 74

4.3 The Actions in Holistic Workflow Framework 74

4.3.1 Raw Data Stage Actions 74

4.3.1.1 Data Ingestion 75

4.3.1.2 Creating Metadata 75

4.3.2 Refined Data Stage Actions 76

4.3.3 Production Data Stage Actions 77

4.4 Transformation Tasks Involved in Data Wrangling 78

4.4.1 Structuring 78

4.4.2 Enriching 78

4.4.3 Cleansing 79

4.5 Description of Two Types of Core Profiling 79

4.5.1 Individual Values Profiling 80

4.5.1.1 Syntactic 80

4.5.1.2 Semantic 80

4.5.2 Set-Based Profiling 80

4.6 Case Study 80

4.6.1 Importing Required Libraries 81

4.6.2 Changing the Order of the Columns in the Dataset 82

4.6.3 To Display the DataFrame (Top 10 Rows) and Verify that the Columns are in Order 82

4.6.4 To Display the DataFrame (Bottom 10 rows) and Verify that the Columns Are in Order 83

4.6.5 Generate the Statistical Summary of the DataFrame for All the Columns 83

4.7 Quantitative Analysis 84

4.7.1 Maximum Number of Fires on Any Given Day 84

4.7.2 Total Number of Fires for the Entire Duration for Every State 85

4.7.3 Summary Statistics 86

4.8 Graphical Representation 86

4.8.1 Line Graph 86

4.8.2 Pie Chart 86

4.8.3 Bar Graph 87

4.9 Conclusion 89

References 90

5 Data Leakage and Data Wrangling in Machine Learning for Medical Treatment 91
P.T. Jamuna Devi and B.R. Kavitha

5.1 Introduction 91

5.2 Data Wrangling and Data Leakage 93

5.3 Data Wrangling Stages 94

5.3.1 Discovery 94

5.3.2 Structuring 95

5.3.3 Cleaning 95

5.3.4 Improving 95

5.3.5 Validating 95

5.3.6 Publishing 95

5.4 Significance of Data Wrangling 96

5.5 Data Wrangling Examples 96

5.6 Data Wrangling Tools for Python 96

5.7 Data Wrangling Tools and Methods 99

5.8 Use of Data Preprocessing 100

5.9 Use of Data Wrangling 101

5.10 Data Wrangling in Machine Learning 104

5.11 Enhancement of Express Analytics Using Data Wrangling Process 106

5.12 Conclusion 106

References 106

6 Importance of Data Wrangling in Industry 4.0 109
Rachna Jain, Geetika Dhand, Kavita Sheoran and Nisha Aggarwal

6.1 Introduction 110

6.1.1 Data Wrangling Entails 110

6.2 Steps in Data Wrangling 111

6.2.1 Obstacles Surrounding Data Wrangling 113

6.3 Data Wrangling Goals 114

6.4 Tools and Techniques of Data Wrangling 115

6.4.1 Basic Data Munging Tools 115

6.4.2 Data Wrangling in Python 115

6.4.3 Data Wrangling in R 116

6.5 Ways for Effective Data Wrangling 116

6.5.1 Ways to Enhance Data Wrangling Pace 117

6.6 Future Directions 119

References 120

7 Managing Data Structure in R 123
Mittal Desai and Chetan Dudhagara

7.1 Introduction to Data Structure 123

7.2 Homogeneous Data Structures 125

7.2.1 Vector 125

7.2.2 Factor 131

7.2.3 Matrix 132

7.2.4 Array 136

7.3 Heterogeneous Data Structures 138

7.3.1 List 139

7.3.2 Dataframe 144

References 146

8 Dimension Reduction Techniques in Distributional Semantics: An Application Specific Review 147
Pooja Kherwa, Jyoti Khurana, Rahul Budhraj, Sakshi Gill, Shreyansh Sharma and Sonia Rathee

8.1 Introduction 148

8.2 Application Based Literature Review 150

8.3 Dimensionality Reduction Techniques 158

8.3.1 Principal Component Analysis 158

8.3.2 Linear Discriminant Analysis 161

8.3.2.1 Two-Class LDA 162

8.3.2.2 Three-Class LDA 162

8.3.3 Kernel Principal Component Analysis 165

8.3.4 Locally Linear Embedding 169

8.3.5 Independent Component Analysis 171

8.3.6 Isometric Mapping (Isomap) 172

8.3.7 Self-Organising Maps 173

8.3.8 Singular Value Decomposition 174

8.3.9 Factor Analysis 175

8.3.10 Auto-Encoders 176

8.4 Experimental Analysis 178

8.4.1 Datasets Used 178

8.4.2 Techniques Used 178

8.4.3 Classifiers Used 179

8.4.4 Observations 179

8.4.5 Results Analysis Red-Wine Quality Dataset 179

8.5 Conclusion 182

References 182

9 Big Data Analytics in Real Time for Enterprise Applications to Produce Useful Intelligence 187
Prashant Vats and Siddhartha Sankar Biswas

9.1 Introduction 188

9.2 The Internet of Things and Big Data Correlation 190

9.3 Design, Structure, and Techniques for Big Data Technology 191

9.4 Aspiration for Meaningful Analyses and Big Data Visualization Tools 193

9.4.1 From Information to Guidance 194

9.4.2 The Transition from Information Management to Valuation Offerings 195

9.5 Big Data Applications in the Commercial Surroundings 196

9.5.1 IoT and Data Science Applications in the Production Industry 197

9.5.1.1 Devices that are Inter Linked 199

9.5.1.2 Data Transformation 199

9.5.2 Predictive Analysis for Corporate Enterprise Applications in the Industrial Sector 204

9.6 Big Data Insights’ Constraints 207

9.6.1 Technological Developments 207

9.6.2 Representation of Data 207

9.6.3 Data That Is Fragmented and Imprecise 208

9.6.4 Extensibility 208

9.6.5 Implementation in Real Time Scenarios 208

9.7 Conclusion 209

References 210

10 Generative Adversarial Networks: A Comprehensive Review 213
Jyoti Arora, Meena Tushir, Pooja Kherwa and Sonia Rathee

List of Abbreviations 213

10.1 Introductıon 214

10.2 Background 215

10.2.1 Supervised vs Unsupervised Learning 215

10.2.2 Generative Modeling vs Discriminative Modeling 216

10.3 Anatomy of a GAN 217

10.4 Types of GANs 218

10.4.1 Conditional GAN (CGAN) 218

10.4.2 Deep Convolutional GAN (DCGAN) 220

10.4.3 Wasserstein GAN (WGAN) 221

10.4.4 Stack GAN 222

10.4.5 Least Square GAN (LSGANs) 222

10.4.6 Information Maximizing GAN (INFOGAN) 223

10.5 Shortcomings of GANs 224

10.6 Areas of Application 226

10.6.1 Image 226

10.6.2 Video 226

10.6.3 Artwork 227

10.6.4 Music 227

10.6.5 Medicine 227

10.6.6 Security 227

10.7 Conclusion 228

References 228

11 Analysis of Machine Learning Frameworks Used in Image Processing: A Review 235
Gurpreet Kaur and Kamaljit Singh Saini

11.1 Introduction 235

11.2 Types of ML Algorithms 236

11.2.1 Supervised Learning 236

11.2.2 Unsupervised Learning 237

11.2.3 Reinforcement Learning 238

11.3 Applications of Machine Learning Techniques 238

11.3.1 Personal Assistants 238

11.3.2 Predictions 238

11.3.3 Social Media 240

11.3.4 Fraud Detection 240

11.3.5 Google Translator 242

11.3.6 Product Recommendations 242

11.3.7 Videos Surveillance 243

11.4 Solution to a Problem Using ml 243

11.4.1 Classification Algorithms 243

11.4.2 Anomaly Detection Algorithm 244

11.4.3 Regression Algorithm 244

11.4.4 Clustering Algorithms 245

11.4.5 Reinforcement Algorithms 245

11.5 ml in Image Processing 246

11.5.1 Frameworks and Libraries Used for ML Image Processing 246

11.6 Conclusion 248

References 248

12 Use and Application of Artificial Intelligence in Accounting and Finance: Benefits and Challenges 251
Ram Singh, Rohit Bansal and Niranjanamurthy M.

12.1 Introduction 252

12.1.1 Artificial Intelligence in Accounting and Finance Sector 252

12.2 Uses of AI in Accounting & Finance Sector 254

12.2.1 Pay and Receive Processing 254

12.2.2 Supplier on Boarding and Procurement 255

12.2.3 Audits 255

12.2.4 Monthly, Quarterly Cash Flows, and Expense Management 255

12.2.5 AI Chatbots 255

12.3 Applications of AI in Accounting and Finance Sector 256

12.3.1 AI in Personal Finance 257

12.3.2 AI in Consumer Finance 257

12.3.3 AI in Corporate Finance 257

12.4 Benefits and Advantages of AI in Accounting and Finance 258

12.4.1 Changing the Human Mindset 259

12.4.2 Machines Imitate the Human Brain 260

12.4.3 Fighting Misrepresentation 260

12.4.4 AI Machines Make Accounting Tasks Easier 260

12.4.5 Invisible Accounting 261

12.4.6 Build Trust through Better Financial Protection and Control 261

12.4.7 Active Insights Help Drive Better Decisions 261

12.4.8 Fraud Protection, Auditing, and Compliance 262

12.4.9 Machines as Financial Guardians 263

12.4.10 Intelligent Investments 264

12.4.11 Consider the “Runaway Effect” 264

12.4.12 Artificial Control and Effective Fiduciaries 264

12.4.13 Accounting Automation Avenues and Investment Management 265

12.5 Challenges of AI Application in Accounting and Finance 265

12.5.1 Data Quality and Management 267

12.5.2 Cyber and Data Privacy 267

12.5.3 Legal Risks, Liability, and Culture Transformation 267

12.5.4 Practical Challenges 268

12.5.5 Limits of Machine Learning and AI 269

12.5.6 Roles and Skills 269

12.5.7 Institutional Issues 270

12.6 Suggestions and Recommendation 271

12.7 Conclusion and Future Scope of the Study 272

References 272

13 Obstacle Avoidance Simulation and Real-Time Lane Detection for AI-Based Self-Driving Car 275
B. Eshwar, Harshaditya Sheoran, Shivansh Pathak and Meena Rao

13.1 Introduction 275

13.1.1 Environment Overview 277

13.1.1.1 Simulation Overview 277

13.1.1.2 Agent Overview 278

13.1.1.3 Brain Overview 279

13.1.2 Algorithm Used 279

13.1.2.1 Markovs Decision Process (MDP) 279

13.1.2.2 Adding a Living Penalty 280

13.1.2.3 Implementing a Neural Network 280

13.2 Simulations and Results 281

13.2.1 Self-Driving Car Simulation 281

13.2.2 Real-Time Lane Detection and Obstacle Avoidance 283

13.2.3 About the Model 283

13.2.4 Preprocessing the Image/Frame 285

13.3 Conclusion 286

References 287

14 Impact of Suppliers Network on SCM of Indian Auto Industry: A Case of Maruti Suzuki India Limited 289
Ruchika Pharswan, Ashish Negi and Tridib Basak

14.1 Introduction 290

14.2 Literature Review 292

14.2.1 Prior Pandemic Automobile Industry/COVID- 19

Thump on the Automobile Sector 294

14.2.2 Maruti Suzuki India Limited (MSIL) During COVID-19 and Other Players in the Automobile Industry and How MSIL Prevailed 296

14.3 Methodology 297

14.4 Findings 298

14.4.1 Worldwide Economic Impact of the Epidemic 298

14.4.2 Effect on Global Automobile Industry 298

14.4.3 Effect on Indian Automobile Industry 301

14.4.4 Automobile Industry Scenario That Can Be Expected Post COVID-19 Recovery 306

14.5 Discussion 306

14.5.1 Competitive Dimensions 306

14.5.2 MSIL Strategies 307

14.5.3 MSIL Operations and Supply Chain Management 308

14.5.4 MSIL Suppliers Network 309

14.5.5 MSIL Manufacturing 310

14.5.5 MSIL Distributors Network 311

14.5.6 MSIL Logistics Management 312

14.6 Conclusion 312

References 312

About the Editors 315

Index 317

Authors

M. Niranjanamurthy M S Ramaiah Institute of Technology, India. Kavita Sheoran MSIT, India. Geetika Dhand Maharaja Surajmal Institute of Technology, India. Prabhjot Kaur