PDF Data Sources Integration Project
Challenge
The client requested GGA to continue development of a system that brings together medical news, communities, research, and other sources from all over the web.
Solution
GGA developed a solution that integrates with the PDF data sources to extract, recognize, and parse data. The application provides a rich user interface for scientific articles search and allows users to show the content of PDF documents on a web page.
The solution has the following layers:
- Document storage: Specific storage where an authorized user can add a new document using HTTP or WebDAV protocols. The document storage is integrated with LDAP to have a single store for user permissions.
- Document processing layer that supports a sophisticated workflow to extract all necessary data according to specification:
- Extract the text for the index.
- Recognize the document metadata, images, chemistry, tables, authors, and abstracts of a scientist's publications.
- Extract keywords and update the ontology database.
- Find similar documents in the current database.
- Prepare static content to show the document on a web page.
- Database and search index: The solution uses a special index that was developed based on the Lucene full text search solution. The index was modified to support the ontology database of the solution and to support the score for searching according to the client's requirements.
- Web layer was developed using the RoR framework, JQuery, HTML, CSS, JS, and RSpec. The solution was deployed to the Amazon Cloud; cookbooks were used to prepare the new instance and to maintain the solution.
Technologies
- Ruby on Rails, Java, SQL, CVS, Amazon, AJAX, JS, Lucene.
Features
- Finds documents in different private data sources.
- Searches in archives containing PDF files from different vendors.
- Processes data to provide the ability to display the contents on the web.
- Business benefits:
- Easy-to-use search interface where the user can review the search results in several perspectives; document includes snippets of relevant text, images related to the entered keywords, and chemistry.
- Simple and fast bulk upload engine (two main protocols are supported: REST and WebDav.)
- Ontology database to provide rich search results.
- Technical benefits:
- Scalable and fast processing engine.
- Fast search engine.
- Ability to scale any element based on the Amazon cloud approach.