Big Data and its need for unicorns

There is no doubt that big data* has become a valuable commodity for this century. A recent article in The Economist entitled Fuel of the future: Data is giving rise to a new economy states that data has become to this century what oil was to the previous one. Given its relevance, most sectors have begun to establish big data initiatives. For instance, the Harvard Business Review collected information from executives at Fortune 1000 companies and government agencies about the use of big data and 85% of organizations reported that they have started or planned to start big data usage.

By Robert Andrade | Jun 01, 2017

Innovative initiatives from private and public entities continue to be developed and the agricultural R&D sector is no exception. Some examples of these initiatives are the recently launched Platform for Big Data in Agriculture led by CIAT-IFPRI that joins CGIAR centers, public, and private research organizations to generate actionable, data-driven insights for stakeholders. The International Agroinformatics Alliance (IAA) led by the UMN, is collecting extensive multi-year, geocoded field trial data from diverse institutions (e.g. CIMMYT, CIAT, Embrapa, etc.) to provide innovative data solutions for breeding by design, among other possible outcomes.

 

With these new initiatives in big data come diverse challenges. The amount of data flowing from different sources requires a broad spectrum of skills to handle big data characteristics (i.e. volume, variety, velocity, veracity, and valence) and to generate value from it. Some have suggested that finding a data scientist (i.e. an expert with skills in mathematics, statistics, computer science, communication, and business) is like finding a unicorn! Nevertheless, there is hope that combining scientists from different backgrounds relevant to big data characteristics will work. An “all for one and one for all” approach as outlined in Alexandre Dumas (The Three Musketeers) historical novel can address the challenges of big data.

As these diverse set of skills are relevant to deal with the challenges of big data, the impact assessment team evaluated themselves with regard to their knowledge and ability in: math and statistics, programming and databases, domain knowledge and soft skills, and communication and visualization. Twelve of our 15 researchers considered themselves to be data scientists and most of them agreed on the need to manage and use data to generate better policy suggestions.

Figure 1 shows the standardized knowledge level the team had. As expected, as most of us are economists, our strengths lie in the math and statistics axis and less in the programming data sets skills ( the solid black line that represents the average level of expertise). Nevertheless, there are some scientists that are quite strong in that area, and this is shown in the upper dotted boundary that represents the maximum level of knowledge that researchers have, which could spill over to others.

Figure 1. Scientist knowledge of big data characteristics.

Figure 2 shows an evaluation of scientists’ knowledge of each characteristic within the main categories presented above – for example, the level of knowledge about machine learning (0 being nothing to 3 being an advanced level). At this level, we are weak in areas such as map reducing concepts, or Hadoop use while our main strength is in statistical modeling or programming. Even when the impact assessment team is continuously working on cleaning and preparing databases for analysis, concepts such as relational algebra or parallel databases can be elusive to us. Nevertheless, in the last half decade, the unit has collected, processed, and analyzed around 17,000 household surveys. In our team, we are not familiar with the jargon used by big data scientists. For instance, relational algebra, which refers to the use of queries and operators to generate output variables is unfamiliar to most economists. The differences in the jargon that economists and big data scientists use is a challenge but we are good at dealing with database management and analysis using statistical analysis methods

Figure 2. Scientist knowledge of specific characteristics of big data.

Dealing with big data requires more than one particular skill and multidisciplinary work. We need to learn the jargon developed around the big data universe to exploit our strengths and support innovative initiatives. After all, we have surmounted the Tower of Babel and have been able to communicate with each other with the same language and work in multidisciplinary teams’ successfully.

* Big data is a term for datasets that are so large or complex that traditional data processing application software cannot deal with them. They pose challenges in terms of data: capture, storage, analysis, curation, search, sharing, transfer, visualization, querying, updating and information privacy. Source: Wikipedia.