The Bad Data Handbook is finally out! It’s the first time I’ve contributed to a publication and I’m incredibly excited about it.
The top required skills of a data scientist are generally considered to be: mathematical know-how, programming capabilities, and some sort of domain knowledge enabling them to ask (and then answer) relevant questions.
This book is about all of the other garbage that you have to put up with along the way. Each chapter is written by someone who has spent more time than they probably would have liked dealing with a specific issue, and they provide some tips and pitfalls. I haven’t read the other chapters yet, but some of the included topics are:
- Test drive your data to see if it’s ready for analysis
- Work spreadsheet data into a usable form
- Handle encoding problems that lurk in text data
- Develop a successful web-scraping effort (that’s me)
- Use NLP tools to reveal the real sentiment of online reviews
- Address cloud computing issues that can impact your analysis effort
- Avoid policies that create data analysis roadblocks
- Take a systematic approach to data quality analysis
I hope at least a few of you buy it and enjoy it.