Specifying data for analysis

We introduce the concept of a “data store”. This represents the data record(s) that you want to analyse. It can be a single file, a directory of files, a zipped directory of files or a single tinydb file containing multiple data records.

We represent this concept by a DataStore class. There are different flavours of these:

  • directory based

  • zip archive based

  • TinyDB based (this is a NoSQL json based data base)

These can be read only or writable. All of these types support being indexed, iterated over, filtered, etc.. The tinydb variants do have some unique abilities (discussed below).

A read only data store

To create one of these, you provide a path AND a suffix of the files within the directory / zip that you will be analysing. (If the path ends with .tinydb, no file suffix is required.)

[1]:
from cogent3.app.io import get_data_store

dstore = get_data_store("../data/raw.zip", suffix="fa*", limit=5)
dstore
[1]:
5x member ReadOnlyZippedDataStore(source='../data/raw.zip', members=['../data/raw/ENSG00000157184.fa', '../data/raw/ENSG00000131791.fa', '../data/raw/ENSG00000127054.fa'...)

Data store “members”

These are able to read their own raw data.

[2]:
m = dstore[0]
m
[2]:
'../data/raw/ENSG00000157184.fa'
[3]:
m.read()[:20]  # truncating
[3]:
'>Human\nATGGTGCCCCGCC'

Showing the last few members

Use the head() method to see the first few.

[4]:
dstore.tail()
['../data/raw/ENSG00000157184.fa',
 '../data/raw/ENSG00000131791.fa',
 '../data/raw/ENSG00000127054.fa',
 '../data/raw/ENSG00000067704.fa',
 '../data/raw/ENSG00000182004.fa']

Filtering a data store for specific members

[5]:
dstore.filtered("*ENSG00000067704*")
[5]:
['../data/raw/ENSG00000067704.fa']

Looping over a data store

[6]:
for m in dstore:
    print(m)
../data/raw/ENSG00000157184.fa
../data/raw/ENSG00000131791.fa
../data/raw/ENSG00000127054.fa
../data/raw/ENSG00000067704.fa
../data/raw/ENSG00000182004.fa

Making a writeable data store

The creation of a writeable data store is handled for you by the different writers we provide under cogent3.app.io.

TinyDB data stores are special

When you specify a TinyDB data store as your output (by using io.write_db()), you get additional features that are useful for dissecting the results of an analysis.

One important issue to note is the process which creates a TinyDB “locks” the file. If that process exits unnaturally (e.g. the run that was producing it was interrupted) then the file may remain in a locked state. If the db is in this state, cogent3 will not modify it unless you explicitly unlock it.

This is represented in the display as shown below.

[7]:
dstore = get_data_store("../data/demo-locked.tinydb")
dstore.describe
[7]:
Locked db store. Locked to pid=8582, current pid=13845
record type number
completed 175
incomplete 0
logs 1

3 rows x 2 columns

To unlock, you execute the following:

dstore.unlock(force=True)

Interrogating run logs

If you use the apply_to(logger=true) method, a scitrack logfile will be included in the data store. This includes useful information regarding the run conditions that produced the contents of the data store.

[8]:
dstore.summary_logs
[8]:
summary of log files
time name python version who command composable
2019-07-24 14:42:56 load_unaligned-progressive_align-write_db-pid8650.log 3.7.3 gavin /Users/gavin/miniconda3/envs/c3dev/lib/python3.7/site-packages/ipykernel_launcher.py -f /Users/gavin/Library/Jupyter/runtime/kernel-5eb93aeb-f6e0-493e-85d1-d62895201ae2.json load_unaligned(type='sequences', moltype='dna', format='fasta') + progressive_align(type='sequences', model='HKY85', gc=None, param_vals={'kappa': 3}, guide_tree=None, unique_guides=False, indel_length=0.1, indel_rate=1e-10) + write_db(type='output', data_path='../data/aligned-nt.tinydb', name_callback=None, create=True, if_exists='overwrite', suffix='json')

1 rows x 6 columns

Log files can be accessed vial a special attribute.

[9]:
dstore.logs
[9]:
['load_unaligned-progressive_align-write_db-pid8650.log']

Each element in that list is a DataStoreMember wghich you can use to get the data contents.

[10]:
print(dstore.logs[0].read()[:225])  # truncated for clarity
2019-07-24 14:42:56     Eratosthenes.local:8650 INFO    system_details : system=Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64
2019-07-24 14:42:56     Eratosthenes.local:8650 INFO    python