|
Research Project
My first project was data system design in Java. The task was to accept, import, and house Affymetrix data provided from the Microarray Suite (MAS) software; specifically the MAS 5 tools. The data system works as such: acquire files, parse/convert XDA or legacy binary to a new ASCII format as needed, parse and import into new data structures, generate sqlloader files, and finally make the call to sqlloader to import the data into the existing dbZach. Further, back-end additions were created in dbZach to allow storage of the new data. Since each import can be more than a million entries, this required efficient use of data structures as well as the requirement to use a faster data dump tool such as sqlloader. Finally, the data loaded required me to link the Affy back-end back to the dbZach back-end, this required me to parse Affy's accession translation files and also load that into the database.
My second project was analyzing and refactoring the existing RealTime PCRsubsystem of dbZach. This involved heavy analysis of past work. Code and Oracle stored procedures were refactored.
My third project was a word/gene expression analysis project: given a set of genes we saw to be overexpressed in mice, were there words/motifs common to these select genes not found in the rest of the mouse genome? This project was a collaborative project between myself, Lyle Burgoon, and Rahul Sarkar. The steps involved in this project were 1) web crawling, 2) sequence insertion, 3) sequence extraction/word generation, 4) word insertion, 5) word count generation, 6) statistical analysis. My job spanned tasks 1 and 3-5. These tasks required heavy database interaction/storage (1.3 billion records) as well as data structure management.
My fourth project was the data visualization project. This java application was designed to be the most extensible application to date here at the Zacharewski lab. It was designed as a core application that handles data passed into it via file and then filters/truncates the data as the user sees fit before passing it on to a plug-in application. My job was the core application which involved design of the back-end logic function as well as the front-end GUI. The first plug-in (3D visualization) was done by Rahul Sarkar.
Along with these projects, I performed numerous number crunching jobs for Jeremy Burt as well as IT administration. I was also trained and performed PCR and Gel Electrophoresis.
TERM TWO:
This term I am working on six projects. The first is in collaboration with Lyle Burgoon in which I retrieved the ~91000 sequences (-10K to 5' UTR and 3'UTR) for Human, Mouse, Rat and generated 100 gigabytes of sequence words which were then re-entered into the Oracle database. These are then used to search for novel common word sequences from laboratory assays.
The second project is a correlation tool that was initially created by Stacy Hung. I generalized and extended the code to allow extremely flexible usage as well as robust filtering.
The third project is a principal component analysis tool (pca) that is implemented with a number of algorithms. These include the classic algebraic as well as singular value decomposition (SVD). Some statistical computation was provided via the Colt package. The aim of this project is to develop a user friendly as well as powerful analysis tool that is not limited to specific laboratory data.
The fourth project is an agglomerative hierarchical clustering tool that will cluster data as well as provide different ways to visualize the results. The tool also allows various types of linkages as well as a few different methods of distance metric.
The fifth project involves a software tool that creates a virtual pipeline between dbZach and EBI's ArrayExpress. This code will involve using MAGEstk as well as adhering to MGED standards for MAGE-ML/MAGE-OM. The output of our database is in XML that is acceptable by ArrayExpress under the MAGE-ML DTD.
The last project involves reanalyzing some work done by previous post-doc Yan Sun which involves Dioxin Response Elements (DREs). This work involves creating position weight matrices that will allow determination of DREs within sequences.
|