{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Specifying data for analysis\n", "\n", "We introduce the concept of a \"data store\". This represents the data record(s) that you want to analyse. It can be a single file, a directory of files, a zipped directory of files or a single `tinydb` file containing multiple data records.\n", "\n", "We represent this concept by a `DataStore` class. There are different flavours of these:\n", "\n", "- directory based\n", "- zip archive based\n", "- TinyDB based (this is a NoSQL json based data base)\n", "\n", "These can be read only or writable. All of these types support being indexed, iterated over, filtered, etc.. The `tinydb` variants do have some unique abilities (discussed below)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A read only data store\n", "\n", "To create one of these, you provide a `path` AND a `suffix` of the files within the directory / zip that you will be analysing. (If the path ends with `.tinydb`, no file suffix is required.)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5x member ReadOnlyZippedDataStore(source='../data/raw.zip', members=['../data/raw/ENSG00000157184.fa', '../data/raw/ENSG00000131791.fa', '../data/raw/ENSG00000127054.fa'...)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from cogent3.app.io import get_data_store\n", "\n", "dstore = get_data_store(\"../data/raw.zip\", suffix=\"fa*\", limit=5)\n", "dstore" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data store \"members\"\n", "\n", "These are able to read their own raw data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'../data/raw/ENSG00000157184.fa'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m = dstore[0]\n", "m" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'>Human\\nATGGTGCCCCGCC'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m.read()[:20] # truncating" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Showing the last few members\n", "\n", "Use the `head()` method to see the first few." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['../data/raw/ENSG00000157184.fa',\n", " '../data/raw/ENSG00000131791.fa',\n", " '../data/raw/ENSG00000127054.fa',\n", " '../data/raw/ENSG00000067704.fa',\n", " '../data/raw/ENSG00000182004.fa']\n" ] } ], "source": [ "dstore.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering a data store for specific members" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['../data/raw/ENSG00000067704.fa']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dstore.filtered(\"*ENSG00000067704*\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looping over a data store" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "../data/raw/ENSG00000157184.fa\n", "../data/raw/ENSG00000131791.fa\n", "../data/raw/ENSG00000127054.fa\n", "../data/raw/ENSG00000067704.fa\n", "../data/raw/ENSG00000182004.fa\n" ] } ], "source": [ "for m in dstore:\n", " print(m)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Making a writeable data store\n", "\n", "The creation of a writeable data store is handled for you by the different writers we provide under `cogent3.app.io`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TinyDB data stores are special\n", "\n", "When you specify a TinyDB data store as your output (by using `io.write_db()`), you get additional features that are useful for dissecting the results of an analysis. \n", "\n", "One important issue to note is the process which creates a TinyDB \"locks\" the file. If that process exits unnaturally (e.g. the run that was producing it was interrupted) then the file may remain in a locked state. If the db is in this state, `cogent3` will not modify it unless you explicitly unlock it.\n", "\n", "This is represented in the display as shown below." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Locked db store. Locked to pid=8582, current pid=13845
record typenumber
completed175
incomplete0
logs1
\n", "

\n", "3 rows x 2 columns

" ], "text/plain": [ "Locked db store. Locked to pid=8582, current pid=13845\n", "=====================\n", "record type number\n", "---------------------\n", " completed 175\n", " incomplete 0\n", " logs 1\n", "---------------------\n", "\n", "3 rows x 2 columns" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dstore = get_data_store(\"../data/demo-locked.tinydb\")\n", "dstore.describe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To unlock, you execute the following:\n", "\n", "```python\n", "dstore.unlock(force=True)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interrogating run logs\n", "\n", "If you use the `apply_to(logger=true)` method, a `scitrack` logfile will be included in the data store. This includes useful information regarding the run conditions that produced the contents of the data store." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
summary of log files
timenamepython versionwhocommandcomposable
2019-07-24 14:42:56load_unaligned-progressive_align-write_db-pid8650.log3.7.3gavin/Users/gavin/miniconda3/envs/c3dev/lib/python3.7/site-packages/ipykernel_launcher.py -f /Users/gavin/Library/Jupyter/runtime/kernel-5eb93aeb-f6e0-493e-85d1-d62895201ae2.jsonload_unaligned(type='sequences', moltype='dna', format='fasta') + progressive_align(type='sequences', model='HKY85', gc=None, param_vals={'kappa': 3}, guide_tree=None, unique_guides=False, indel_length=0.1, indel_rate=1e-10) + write_db(type='output', data_path='../data/aligned-nt.tinydb', name_callback=None, create=True, if_exists='overwrite', suffix='json')
\n", "

\n", "1 rows x 6 columns

" ], "text/plain": [ "summary of log files\n", "====================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================\n", " time name python version who command composable\n", "--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n", "2019-07-24 14:42:56 load_unaligned-progressive_align-write_db-pid8650.log 3.7.3 gavin /Users/gavin/miniconda3/envs/c3dev/lib/python3.7/site-packages/ipykernel_launcher.py -f /Users/gavin/Library/Jupyter/runtime/kernel-5eb93aeb-f6e0-493e-85d1-d62895201ae2.json load_unaligned(type='sequences', moltype='dna', format='fasta') + progressive_align(type='sequences', model='HKY85', gc=None, param_vals={'kappa': 3}, guide_tree=None, unique_guides=False, indel_length=0.1, indel_rate=1e-10) + write_db(type='output', data_path='../data/aligned-nt.tinydb', name_callback=None, create=True, if_exists='overwrite', suffix='json')\n", "--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n", "\n", "1 rows x 6 columns" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dstore.summary_logs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Log files can be accessed vial a special attribute." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['load_unaligned-progressive_align-write_db-pid8650.log']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dstore.logs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each element in that list is a `DataStoreMember` wghich you can use to get the data contents." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2019-07-24 14:42:56\tEratosthenes.local:8650\tINFO\tsystem_details : system=Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64\n", "2019-07-24 14:42:56\tEratosthenes.local:8650\tINFO\tpython\n" ] } ], "source": [ "print(dstore.logs[0].read()[:225]) # truncated for clarity" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:c3dev] *", "language": "python", "name": "conda-env-c3dev-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.1" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }