Data cleaning includes filling missing values in data sets. Photo by: geralt (via Pixabay)
Data science involves creating models that can predict the future, such as what the yields will be for the next planting season. This work, arguably, is “sexy.”
There’s an aspect of data science, though, that is “unsexy” but a requisite to actually developing those predictive models.
It’s called data cleaning or data curation.
Depending on the quality of a data set, data cleaning takes up between 60 percent and 80 percent of a data scientist or a data analytics team’s time.
Take for instance the data sets that the CIAT data team had to deal with for the recently concluded 2018 Syngenta Crop Challenge in Analytics. The team took home the competition’s top prize, besting some of the top-notch data scientists and teams from around the globe.
The contest’s organizers asked competitors to come up with models predicting the yield of hybrid maize using genetic and climate variability data sets that featured thousands of variables. According to Hugo Andres Dorado Betancourt, a member of the winning team, the data sets were large compared to what he and his teammates would deal with on a regular basis.
“We’re used to limited data, not those big volumes of data,” he said.
Dorado described the data sets to be of “good quality,” noting that Syngenta did a lot of “preprocessing,” which included using the same text casing and terms to label an entry.
Yet, it still took the team two out of the three months allotted to work on its entry and submit it to the competition.
The team used those two months to fill missing data and determine which of the variables to keep in order to develop the model. To do this, they sought the help of experts within CIAT.
Ordinarily, however, the CIAT data team has to deal with agricultural data sets that are unlike those provided by Syngenta. Thus, the process of cleaning them is longer.
Daniel Jimenez, who leads the team, attributed the lengthy time to clean agriculture data to the lack of standardization of common terms used in the field.
Some, he said, would call “rainfall” what others would call “precipitation.” Or it would be “farmers” to some, and “producers” or “growers” to others. Some data set entries would have hyphens (-), while others would include the underscore symbol (_)
If terms in databases are not standardized, running queries — say the number of farmers in Nicaragua under 45 years of age that had some level of education and own a farm in a region with precipitation above 1,000 mm — would be impossible.
“When you talk about FAIR [findable, accessible, interoperable, reproducible] data, we should be able to run those queries and run them automatically. But we cannot do that because we have to, at the moment, manually clean the data.” Jimenez said. “It’s definitely a frustrating process.”
Within the CGIAR Platform for Big Data in Agriculture, there’s a group of experts working to standardize agricultural terms. Both the Ontologies Data community of practice and the Organize module have developed tools for data harmonization.
Beyond the CGIAR system, there’s still much work to be done.
“We have decades of agricultural data that used to be locked up that are now made available. Most of them are unstructured, so it’s kind of a mess,” Jimenez said. “That’s why we need to be more serious about standardization, to make data interoperable.”