The following is a list of projects completed by the research group of Prof. Michael Gertz both at Heidelberg University (since 2008) and the University of California at Davis (1997-2008)
PhD Graduate School CrowdAnalyser: Spatio-temporal Analysis of User-generated Content (U Heidelberg)
A key characteristic feature of the Web 2.0 is that data is
voluntarily provided by users on the Internet through portals such as
Wikipedia, YouTube, Flickr, Twitter, Blogs, OpenStreetMap, and various
social networks at an unprecedented scale and staggering rate. In
today’s information society and knowledge economy these portals provide a
valuable resource for diverse application domains. The enormous
potential of this voluntarily generated (crowdsourced) data through the
masses of volunteers (crowd) is increasingly recognized, but in many
areas, especially in science, it is not utilized to its full potential.
There are several unsolved issues that arise from these rapidly
increasing, very dynamic and highly heterogeneous data streams of
content created by users. Addressing these issues has the goal to
automatically assess and develop this new type of poorly structured data
for different application domains, in particular, to infer new
information. The participating research groups in Heidelberg have done
pioneering work in these directions, especially in the context of
utilizing geographic data. The objective of the college is to develop
novel methods and approaches towards the quality-oriented analysis and
exploration of crowdsourced Web 2.0 data as well to further improve and
scale existing methods.
The goal of this NSF-funded project was to develop a practical cyberinfrastructure prototype to facilitate the study of the way in which multiple environmental factors, including climatic variability, affect major ecosystems along an elevation gradient from coastal California to the summit of the Sierra Nevada. An understanding of the coupling between the strength of the California upwelling system and terrestrial ecosystem carbon exchange is the central scientific question. Additional scientific goals were to better understand the way in which atmospheric dust is transported to Lake Tahoe and an examination of carbon flux in the coastal zone as moderated by upwelling processes. The geographic context is one in which there is a diversity of ecosystems that are believed to be sensitive to climatological changes. The dispersion and complexity of the data needed to answer the scientific questions motivate the development of a state-of-the-art cyberinfrastructure so facilitate the scientific research. This cyberinfrastructure is based around the integration of access to distributed and varied data collections and data streams, semantic registration of data, models and analysis tools, semantically-aware data query mechanisms, and an orchestration system for advanced scientific workflows. Access to this cyberinfrastructure was provided through a web-based portal. Prof. Dr. Michael led this project until he moved from UC Davis to Heidelberg University .
Open Source systems such as Linux and Mozilla are both innovative and popular. They are built by an informal group of volunteers, working in a distributed, asynchronous manner. Communication and coordination are mediated by emails and shared repositories (containing manuals, design documents, source code and bug reports). This repository constitutes an extensive on-line record of user feedback, artifact evolution and task-related problem-solving behaviors. This data is publicly available, and is amenable to modern data mining techniques, as well as automated program analysis algorithms. Our goal is to integrate these automated analyses with social science methods to study the intrinsic relationship between community behavior and software engineering outcomes in large-scale open source projects, with view to improving software engineering practice. We have assembled an inter-disciplinary research team consisting of a software engineer, a social scientist, and a database researcher. Software systems have an enormous impact on the economy and on society; an improved understanding of the social processes underlying software development could well lead to faster development of cheaper, better software systems. Our interdisciplinary collaboration builds on existing work by explicitly connecting software engineering imperatives to the techniques of social science; thus we hope to evaluate the relevance of folklore principles such as "Conway's Law," that claims a relationship between artifact structure and community structure. Finally this collaboration will help us formulate new, much-needed interdisciplinary pedagogy to train both undergraduate and graduate software engineers in the social aspects of software development: team-building, work allocation, coordination, project management, and process improvement.
Many of today's mission critical databases have not been designed with a particular focus on security aspects such as integrity, confidentiality, and availability. Even if security mechanisms have been used during the initial design, these mechanisms are often outdated due to new requirements and applications, and do not reflect current security polices, thus leaving ways for insider misuse and intrusion. The proposed NSF funded research was concerned with analyzing various security aspects of mission critical (relational)databases that are embedded in complex information system infrastructures. We proposed four complementary avenues of research: (1) models and techniques to profile the behavior of mission critical data stored in databases, (2) algorithms to correlate (anomalous) data behavior to application/user behavior, (3) techniques to determine and model user profiles and roles from behavioral descriptions, and (4) the integration of techniques, algorithms, and mechanisms into a security re-engineering workbench for (relational) databases. Two major themes built the core of the proposed approaches. First, the analysis of database vulnerabilities and violations of security paradigms is data-driven, i.e., first the behavior of the data is analyzed and modeled before it is correlated to users and applications. Second, we introduced the concept of access path model to uniformly model and correlate data flow and access behavior among relations, users, and applications. This model allows security personnel to fully inspect database security aspects in complex settings in a focused, aspect (policy) driven fashion.
This project lays out our plan for research and development aimed atdramatically improving the process and
outcome of scientific data analysis and visualization. The improvement
will be achieved by coupling an expressive and extensible metadata
management framework with novel visualization interfaces that facilitate effective reuse, sharing, and cross-exploration of visualizationinformation and thus will make a profound impact
on a broad range of scientific applications. The process of scientific
visualization is inherently iterative. A good visualization comes from
experimenting with visualization and rendering parameters to bring out
the most relevant information in the data. This raises a question. Considering the computer and human time we routinely invest for exploratory and production visualization, are there methodologies and mechanisms to enhance not only the productivity of scientists but also their understanding of the visualization process anddata used?
Recent advances in the field of data visualization have been made mainly in rendering and display technologies (such as realtime volume rendering and immersive environments), but little in coherently managing, representing, and sharing information about the visualization process and results (images and insights). Naturally, the various information about data exploration should be shared and reused to leverage the knowledge and experience scientists gain from visualizing scientific data. A visual representation of the data exploration process along with expressive models for recording and querying task specific information help scientists keep track of their visualization experience and findings, use it to generate new visualizations, and share it with others.
While previous research has addressed some related issues, a more comprehensive study remains to be done. Thus, we propose two complementary avenues of research: (1) new user interfaces for data visualization tasks, and (2) expressive metadata models supporting the recording and querying of information related to data exploration tasks. In addition, a set of user studies will be conducted on a Web-based visualization testbed realizing (1) and (2) in order to refine the proposed methodologies and designs. Traditional user interfaces cannot support the increasingly complex process of scientific data exploration. A fundamental change in the conventional designs and functionality must be made to offer more intuitive interaction, guidance, and enhanced perception. We will begin our study with enriching the graphbased and spreadsheet-like interfaces we have developed, and also investigate alternative designs. An expressive and extensible metadata model representing the data exploration process and its embedded data visualization process is needed. Such a model along with an appropriate user interface makes it possible to manage diverse information about the input and results of the visualization process, analyze parameter coverage and usage, identify unexplored visualization spaces, and incorporate findings on the process and results in form of visualization metadata. The model is independent of the actual visual interface used and is open in that its realization in form of a metadata repository can be loosely coupled with a variety of different visualization tools. A set of interfaces and protocols to the repository will be designed to manage, query, and analyze visualization metadata gathered from and utilized by different visualization tools.
Our society increasingly relies on prompt, accurate delivery of
information over the Internet. Users need assurance that the
information they get in this way is authentic, and they need to get this
assurance in a cheap and reliable way.
The research centers around a new approach for engendering this confidence. The starting point is to separate the roles of the "owner" of a database and its "publisher" (or publishers). With this approach the user need not trust the publisher. Instead, the owner of the database provides the user with a small amount of "summary information". After that, the publisher not only answers the user's questions, but also provides, along with each answer, a short "digital certificate" of accuracy. Using the summary information, the certificate lets the user check that the information received is correct and complete. Developing and evaluating good schemes to construct these certificates is the key technical challenge.
The publisher need not maintain a trusted system, lowering his cost of doing business. The publisher can also more easily provide information from multiple owners. Overall, the approach should make it cheaper to obtain reliable data over the Internet, and will expand the settings where the data is used.