Big Data Platform series: Setting up the BigDataGrapes platform

It all started back in 2017 when our consortium had an ambitious dream; to create a data-powered solution, collecting, processing, translating, enriching food and agriculture data supporting the communities in and around the grapevine-powered industries. From this year the BigDataGrapes transformed from a dream into an EU-funded Horizon 2020 project, and we are already in the third -and final - year of the project.

So let’s start with introductions: “The BigDataGrapes Big Data Platform”

In a nutshell, the BigDataGrapes Big Data Platform is a back-end system responsible for collecting, processing, indexing, and publishing heterogeneous food and agriculture data from a large variety of data sources. The platform is organized in a microservice architecture, with different technology components handling different aspects of the data lifecycle. All of the components are interconnected using well-defined connectors and API endpoints, each responsible for storing and processing different types of data. More specifically, the platform includes:

  • the Data APIs component, which is the machine-readable interface to the different types of data collected in the platform. This part of the architecture is responsible for making data discoverable, but also for submitting new data assets back to the platform;
  • the Data Integration component, through which data is submitted to the platform through a workflow of four unique steps: data collection, data filtering and transformation, data enrichment and data curation; the Data Indexing component, which performs data transformation to an appropriate format designed for performance optimization;
  • the Storage component, which features various storage engine technologies, responsible for the physical archiving of data collections;
  • the Knowledge Classification component, which provides rules and standards for the organization of data records stored and processed by the platform
  • the Data Processing component, which is responsible for hosting individual text mining, machine learning and data correlation scripts that can be used in a variety of contexts as standalone pieces of code or as web services through the so-called Intelligence APIs.


The Platform Challenges: Storage

Data collection is a major part of the BigDataGrapes platform, which will be further analyzed in a future post. From Earth observation data to NDVI indexes and environmental measurements, the platform is set to handle diverse datasets. The critical part at this point is storage! We have a wide variety of data, each with its own properties. Data concerning food recalls that have a velocity of at most under a hundred per day and data coming from sensors deployed on a field level which presented a velocity of thousands per day; no one framework would be able to meet our needs. So we decided to split the data based on its velocity. Those presenting lower velocity (eg. raw html, xls) would be stored into a MongoDB instance and those with a higher one into an Apache Cassandra cluster. But data is useless if you cannot process it, so the next part of our stack is our Transformer into our internal schema; of course, based on the entity type we are processing. To that end, we employed into our stack python and PHP scripts as well as a custom Java project, all of which take care of the harmonization of the collected data.



The Platform Challenges: Data enrichment

Time for our cooler parts of the stack to take action. We should identify important terms in our collected data. This is where data mining, NLP, NER, ML, and DL techniques are employed. A number of projects and respective API endpoints are deployed taking care of these tasks. As far as technologies and frameworks are concerned, we have Spring {Boot, Data} projects, Flask endpoints taking advantage of scikit, and Keras classifiers all communicating with Elasticsearch instances and internally trained models. Each producing an accuracy score, that if it is above a threshold it is accepted as a valid response.



How does the Platform serve data scientists?

Having all the data stored in the BigDataGrapes platform storage component, it's time to see an example of how the different components work together.

One of the user groups that the BigDataGrapes Platform targets are data scientists working in research, academia, or the industry. The main need for data scientists is a fast selection of datasets and execution of algorithms in order to get a swift view of the expected results.

How is this made possible through the Big Data Platform of BigDataGrapes? First, data scientists execute simple and complex queries to locate the dataset that is covering their needs. Apart from the available datasets, someone can create a synthetic dataset in order to increase the volume of the data, and test the algorithms. This is made possible by the platform components that are developed by ONTOTEXT, the semantic technology branch of Sirma Group.

The next stage after having the selected data is the execution of algorithms. In BigDataGrapes, CNR - the National Research Council of Italy, and Agroknow - the food safety intelligence have developed a wide range of prediction algorithms (we will have the chance to dive into algorithms in a future post). In this stage, data scientists can select through an Intelligent API which prediction or correlation algorithm they want to execute.

The final stage in the data scientist journey within the platform is the results of the algorithm execution. Data visualization is key and the team of KU Leuven has developed interactive visualization components and dashboards, where users can see the results of the algorithm execution process.