Tidy Data

Of course we have technology in our galleries and classrooms and information on the Web; of course we are exploiting social media to reach and grow our audiences. . . . But we aren’t conducting art historical research differently. We aren’t working collaboratively and experimentally.

James Cuno, Daily Dot Article

I start with this quote from Modernizing Art History because of its relevance to clean data. Many of the things Mr. Cuno mentions here are technologies that don’t necessarily need massive amounts of man power in cleaning data. Putting technology in galleries and classrooms is often in the form of ipads, touch screens, or laptops- technology that often can come out of the box with little effort. Putting information on the web, while it can take time to gather research and what not, ultimately the process of putting the information up is not far off writing an article or other piece of writing. Social media is much the same, while effort generally goes into putting together a content calendar, the tools themselves are relatively straightforward. And then there’s cleaning data for research which is a different bear entirely.

While visualizing data can yield fascinating results, the work that it takes to get to the point of actually analyzing the data can be incredibly time consuming. Cleaning data may seem straightforward, but often people will start cleaning data only to realize they want to do something another way. For example, maybe at first while cleaning data you decide to insert the date one way, but then midway through you change your mind. You then have to go back and make sure all of the data follows the same convention. This issue with data cleaning also points to why someone may shy away from collaborative work. Often, people have different ideas of how to best work at cleaning data, and unless the conventions are set up from the very beginning, you risk having to do work two and three times to make sure everything is consistent.

I ran into this while working on the Dictionary of Art Historians. Having started as a card catalogue in the 80s, the information contained in the dictionary has gone through many iterations, and has had several collaborators. This has resulted in several different forms of the data inside the dictionary. For example as we saw in class, the birthplace of someone may be inputted as just a country, as a city and country, as a city and country in the native tongue, there can be many different options for the form the data can take. At one point, I went through the data and cleaned up the field for home country (so one could search through the results using a faceted search that filtered only the Italian art historians or only the Germans) The problem with this field was the fact that everyone had inputted the country differently Italy vs. Italia vs. Kingdom of Napoli. As people type something in it may come out as frace instead of france. There can be many different problems with the data. This is one reason to advocate for controlled vocabularies that can be integrated into a database such that someone inputting data can pick from a drop down list rather than putting in the data themselves.

One kind of data I’m thinking about working with is comparing the versions of texts of the original Diana and Actaeon myth using a tool like Voyant or venturing into Python NLTK. In order to do this it would be important to put the texts into plain text files. Then it would be important to make sure the text that is in the file, is just the story I want, as many of the versions may have other stories before or after it. Whily Voyant will take stopwords out, if using a tool like NLTK the text has to go through several iterations before you can get reliable data. This involves taking out stopwords, truncating, and tokenizing, many of which are techniques that are used in search retrieval engines like google. As we learned in class, stopwords are words that occur frequently but have little meaning. These lists can be very short or very long depending on the text, and of course it’s important to remember that each language has its own stop words list. Truncating refers to combining words that have the same prefix. You may notice this in your google search if you search something like “computer science” but you get results involving the word computing. The final technique is tokenizing, which involves parsing a sentence down into words that the computer is then able to read.

Obviously in the Prown article we read art historians are interested in the types of research questions can be generated by “big data”, but there is still resistance even today to using data in a formalized manner. Maybe it’s a silly thought, but I do wonder if part of the reason art historians maybe reticent to use data is because of the prep involved. Many of us work in a manner where we look at the object first and then do research later. You don’t have to do a lot of work to prep your eyes to look at the object, in a sense visual analysis gives a sort of instant gratification. While art historians are clearly detail oriented, data input and cleaning takes a sort of monotonous eye to detail that we’re not used to. This all being said, I think one can get better at data cleaning with practice and can become comfortable with inputting data in such a manner.

Maybe one day our computers will become advanced enough to read messy data, but until then we have to deal with keeping our data tidy.

css.php