I wrote my first Python program in 1996, and my most recent a couple of weeks ago, so I can appreciate Python’s advance to cover a very broad range of computing tasks. I don’t program much anymore, but in my work over the years — and yours too, if you do much coding — data manipulation has always played an important role. You can’t build and apply analytical models, manage transactions, craft a Web experience, or carry out any other significant task without investing time and attention to data acquisition, cleansing, and structuring. Python is ideal for those tasks, and then for model building and data analysis. Python is great for natural language processing (NLP), in particular, a special interest of mine, and for just about any data work that interests you, chances are.
This interview with Pythonista Katharine Jarmul focuses on data work. A couple of events provide context. Katharine is presenting a talk titled “How Machine Learning Changed Sentiment Analysis, or I Hate You Computer 😉” at this year’s Sentiment Analysis Symposium, July 12 in New York, following which she’s offering a New York class, Learn Big Data Wrangling with Python, July 13-14. We’ll get into Katharine’s background in the course of the interview. I’ll add now only that she’s co-author of O’Reilly’s Data Wrangling with Python, published earlier this year.
Seth Grimes> Make some converts: Why should folks use Python for data work, and in particular for natural language processing and sentiment analysis?
Katharine Jarmul> It’s actually pretty hard to argue against using Python for these tasks. With Google using Python (primarily) for TensorFlow models, Parsey McParseface [SyntaxNet], and word2vec as well as hundreds of start-ups and open source tools making advancements for machine learning, sentiment analysis and NLP in Python, I’d love to hear a good argument against it as the language du jour. I love Python because it’s easy to read, it has great math and science libraries, it’s proven to be quite scalable and the community is unbeatable.
Seth> Your consulting work centers on market analysis, which involves data of varied types — text and numeric and perhaps geospatial and time based — from disparate sources. Do you have any special guidance, regarding ways to clean and mash it all up in ways that make sense and produce justified insights?
Katharine> I actually just gave a talk about this at PyData Berlin (which was an amazing conference!). Data wrangling and data cleaning are the un-sexy bits of our daily work, and I wish more people were talking about them, since I think there’s a lot of work we can do to make them less painful. For me, I generally use Python and Pandas to perform some of these tasks, but there are so many tools and techniques available. In preparation for my talk, I also read a lot of the latest research and academic papers on automating the data cleaning process via machine learning. To help move the technical side along, I’ll be putting together a literature review on the topic, and hopefully we can start building some great open-source tools to help us make line-by-line data cleaning a thing of the past.
You’re not like most of the technologists I encounter: You made a career change from journalism and public policy, to data and analytics. What motivated the switch?
In my opinion, the distance between data for journalism and data for start -ups is actually quite small. When I was working at the Washington Post and USA TODAY I was in charge of quite a few projects involving data wrangling and data munging, so those skills were shared between the two. At a start-up, however, I usually had more autonomy to make technical decisions and to grow and learn more technical things, so for me it was a natural progression of my interest in the field.
I assume your journalism and policy skills and experience have informed your approach to your current work. If that’s the case, in what ways? Or are the disciplines really different for you?
I think my background in journalism helps when it comes to communication and reporting. Many times my clients aren’t statisticians or data scientists. They want to know what the numbers mean. I had a few great professors in Journalism school who helped me with communicating my mathematical knowledge into an understandable and comprehensible article. I now can use those skills to work with my clients and make sure they understand the competitive landscape for their technology or start up.
Your Sentiment Analysis Symposium presentation is titled “How Machine Learning Changed Sentiment Analysis, or I Hate You Computer.😉” What species of machine learning do you see as applicable to sentiment problems, and which toolkits?
I myself am not a machine learning expert, nor do I use it often in my work. I am, of course, interested in the topic. As a Python developer, it’s very easy to write 10 lines of code that “just work” using the amazing tools available such as TensorFlow, scikit-learn, and Theano. There are even more I haven’t had the chance to play with, so it’s a great time to be in machine learning. Regarding sentiment analysis, I recommend taking a look at Spacy.io, run and primarily written by Matthew Honnibal. They already have some interesting training sets with informal text, and have some great resources on how to get started.
Are you also applying established techniques, stuff like the lexical analysis and parsing you get with NLTK?
Most of the toolkits I’ve used have these as a part of the library, yes. 🙂
Certain emoji, including your winking emoji, most-often negate rather than emphasize: You’re communicating that “I Hate You Computer” is ironic. Have you worked in emoji analytics yourself? In techniques aimed at understanding irony and sarcasm and the like? If not, is that stuff on your to-do list, or is not important in market analyses and other work you take on?
Again, I am more of an NLP user rather than library creator. For my upcoming talk I’ll be interviewing Matthew Honnibal about what they have done with sense2vec, and it’s pretty amazing. Emoji are just unicode, and for that reason, they can be parsed just like anything else. In a PyData user group talk in Berlin, Spacy.io was demonstrated to know that a smiley face emoji is similar to other positive emoji faces. At the end of the day, text is parseable and emojis are just special code points, so I don’t see why we aren’t in an age where this is a (nearly) solved problem.
I’ve learned to be wary when a coder uses the word “just.” And I’ll add that one of the most interesting talks at last year’s sentiment symposium was “Emojineering at Instagram,” Instagram software engineer Thomas Dimson, covering semantic analysis of emoji use.
What (else) is on your to-do list, to learn and to apply, when it comes to Python, data wrangling, machine learning, and NLP and sentiment analysis?
I’ll be focusing on the intersection of data cleaning and machine learning. It’s of interest to me and I’m already chatting with some folks about what happens next in terms of open-source libraries to use and possibly build. If you are also interested in these problems, feel free to reach out! I’m @kjam on twitter and freenode or reachable by email katharine (at) kjamistan.com.
Very cool. Thanks Katharine.
This interview seems to have elicited from Katharine a catalog of Python machine learning and NLP modules, essential for data wrangling and analysis. I’m looking forward to meeting her in New York!