ASU CASCADE and SRP Joint Research Project

   One of the biggest challenges large organizations face today is their inability to utilize data coming from heterogeneous sources. These large entities use software from different vendors, which often have different data representations. In order to get unified access, organizations are in need of the metadata that reconciles the data spread over these unaligned sources. Arizona State University's Center for Assured Scalable Data Engineering (CASCADE), partnered with the Salt River Project (SRP) to provide a vendor-agnostic Transmission System Model to organize and maintain data across their organization. The principal investigator for the project was Prof. Dragan Boscovic and the team was mentored by Prof. Mohamed Sarwat and comprised two Ph.D. students, Vamsi Meduri and Manjusha Ravindranath.

   SRP has been concerned with inconsistency in the way they look at their data. The goal of the project was to consolidate different views so that different groups will have consistency when using the organization's data. The project was organized into six phases: Analysis of sources, Dictionary creation at the schema level, Exchange, Dictionary creation at the data level, Code writing, and API/Querying/Application/Documentation.

  1. Analysis of sources: In the first phase, the team developed a deep understanding of the target systems, this required analytical insight into the databases. The risks associated with stage one consisted of a lack of clarity when it came to the data stored on the source and target systems.

  2. Dictionary creation at the schema level (Mapping): The second phase was intended to align the data sources of interest in terms of their own schema. Risks associated with stage two was the need for human interaction due to the need for high precision and coverage.

  3. Exchange: Phase number three was the extraction, transformation, and importation of the data into the new database. Incompatibility between technologies would result in the risk of higher complexity and effort.

  4. Dictionary creation at the data level: Phase four created a data dictionary, this was done through automation with algorithms that linked heterogeneous entities. The risk associated with phase four came from the lack of token information, which leads to errors due to inconsistencies.

  5. Code Writing: Once a unified view was created, phase five was an update to the codebase and stored procedures. Existing code can be difficult to modify which may lead to a risk of requiring human interaction and interviews with the original authors.  

  6. API/Querying/Application and Documentation (this stage has yet to be completed): The final stage ended with making the data available by applications or for regular querying. The final phase required the help from final users and developers to reduce the risk of making tools that won’t be used in the organization.

 

     Out of the six phases, the overall greatest risk that the team faced was the lack of a formal training in electrical terminology and a basic understanding of the related field. The team at ASU had to constantly communicate with SRP to make sure that they were building tools that could be leveraged by the organization. For anyone from Computer Science or Information Systems who aspires to work on an applied project related to a different formal background like power systems, it is important to understand that while a basic knowledge of the electrical terminology is essential and helps one to make a quick head start into the project, spending too much time on the nitty-gritty of electrical systems is not advisable. Instead, it would be time-saving for each team member to stick to her strengths and the lacking background should be filled in by constantly communicating with the other team members who have sufficient expertise in it. The team at ASU focused on their strength which is the expertise in databases and data integration to tackle this applied research problem, while the group at SRP was participating in explaining the context pertaining to electrical systems, the semantic meaning of the data and by providing feedback through verification and manual annotation of the results obtained from automatic reconciliation. The manual labels provided by SRP at the end of each phase in the data integration pipeline helped improve the accuracy of the entity matching functions manifold. This project serves as a testament to how effectively the human feedback helps in improving the automatic data integration tasks, thus showcasing the importance of human-database interaction.

 

   The team believes that while a strong foundation in the concepts of data integration, schema mapping, and entity matching is required for this project, it is not directly transferable while working on an applied industrial project. Working on the SRP project required an application of the knowledge obtained from research papers and textbook material to the real world, but the difference was that in the real world, the application scenario is not artificial or synthetic unlike in the textbooks. However, the team believes that the difficulty in real-world problem solving is the fun part when working on applied research projects. A significant progress in this project would have been impossible without the participation of all the team members or the feedback and continuous help from SRP to fill up the missing semantic knowledge about electrical systems. Since the involvement of the two teams (at ASU and SRP) in the project was complimentary, this contributed heavily to the progress made so far

   The project resulted in the creation of a unified database schema for SRP that is free from inconsistencies. Now, any application built on the top of this schema will solely reference the unified database that the team created, and this will create a consistent view of the organization's data. The team’s next tasks are to take existing applications and modify them to run on the unified database. A follow-up project in the Fall semester of 2018 will focus on the equipment data which is more dynamic and ever-changing as against the static location model. The team at ASU looks forward to collaborating with SRP to take up the next bigger challenge.


Original Project Proposal Link


CASCADE R&D is supported by funds from several funding agencies, including NSF and DOE, as well as various industrial partners.