Specifying data for analysis¶
We introduce the concept of a “data store”. This represents the data record(s) that you want to analyse. It can be a single file, a directory of files, a zipped directory of files or a single tinydb
file containing multiple data records.
We represent this concept by a DataStore
class. There are different flavours of these:
directory based
zip archive based
TinyDB based (this is a NoSQL json based data base)
These can be read only or writable. All of these types support being indexed, iterated over, filtered, etc.. The tinydb
variants do have some unique abilities (discussed below).
A read only data store¶
To create one of these, you provide a path
AND a suffix
of the files within the directory / zip that you will be analysing. (If the path ends with .tinydb
, no file suffix is required.)
[1]:
from cogent3.app.io import get_data_store
dstore = get_data_store("../data/raw.zip", suffix="fa*", limit=5)
dstore
[1]:
5x member ReadOnlyZippedDataStore(source='../data/raw.zip', members=['../data/raw/ENSG00000157184.fa', '../data/raw/ENSG00000131791.fa', '../data/raw/ENSG00000127054.fa'...)
Data store “members”¶
These are able to read their own raw data.
[2]:
m = dstore[0]
m
[2]:
'../data/raw/ENSG00000157184.fa'
[3]:
m.read()[:20] # truncating
[3]:
'>Human\nATGGTGCCCCGCC'
Showing the last few members¶
Use the head()
method to see the first few.
[4]:
dstore.tail()
['../data/raw/ENSG00000157184.fa',
'../data/raw/ENSG00000131791.fa',
'../data/raw/ENSG00000127054.fa',
'../data/raw/ENSG00000067704.fa',
'../data/raw/ENSG00000182004.fa']
Filtering a data store for specific members¶
[5]:
dstore.filtered("*ENSG00000067704*")
[5]:
['../data/raw/ENSG00000067704.fa']
Looping over a data store¶
[6]:
for m in dstore:
print(m)
../data/raw/ENSG00000157184.fa
../data/raw/ENSG00000131791.fa
../data/raw/ENSG00000127054.fa
../data/raw/ENSG00000067704.fa
../data/raw/ENSG00000182004.fa
Making a writeable data store¶
The creation of a writeable data store is handled for you by the different writers we provide under cogent3.app.io
.
TinyDB data stores are special¶
When you specify a TinyDB data store as your output (by using io.write_db()
), you get additional features that are useful for dissecting the results of an analysis.
One important issue to note is the process which creates a TinyDB “locks” the file. If that process exits unnaturally (e.g. the run that was producing it was interrupted) then the file may remain in a locked state. If the db is in this state, cogent3
will not modify it unless you explicitly unlock it.
This is represented in the display as shown below.
[7]:
dstore = get_data_store("../data/demo-locked.tinydb")
dstore.describe
[7]:
record type | number |
---|---|
completed | 175 |
incomplete | 0 |
logs | 1 |
3 rows x 2 columns
To unlock, you execute the following:
dstore.unlock(force=True)
Interrogating run logs¶
If you use the apply_to(logger=true)
method, a scitrack
logfile will be included in the data store. This includes useful information regarding the run conditions that produced the contents of the data store.
[8]:
dstore.summary_logs
[8]:
time | name | python version | who | command | composable |
---|---|---|---|---|---|
2019-07-24 14:42:56 | load_unaligned-progressive_align-write_db-pid8650.log | 3.7.3 | gavin | /Users/gavin/miniconda3/envs/c3dev/lib/python3.7/site-packages/ipykernel_launcher.py -f /Users/gavin/Library/Jupyter/runtime/kernel-5eb93aeb-f6e0-493e-85d1-d62895201ae2.json | load_unaligned(type='sequences', moltype='dna', format='fasta') + progressive_align(type='sequences', model='HKY85', gc=None, param_vals={'kappa': 3}, guide_tree=None, unique_guides=False, indel_length=0.1, indel_rate=1e-10) + write_db(type='output', data_path='../data/aligned-nt.tinydb', name_callback=None, create=True, if_exists='overwrite', suffix='json') |
1 rows x 6 columns
Log files can be accessed vial a special attribute.
[9]:
dstore.logs
[9]:
['load_unaligned-progressive_align-write_db-pid8650.log']
Each element in that list is a DataStoreMember
wghich you can use to get the data contents.
[10]:
print(dstore.logs[0].read()[:225]) # truncated for clarity
2019-07-24 14:42:56 Eratosthenes.local:8650 INFO system_details : system=Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64
2019-07-24 14:42:56 Eratosthenes.local:8650 INFO python