Tabular data

Loading delimited formats

We load a comma separated data file using the generic LoadTable function.

>>> from cogent3 import LoadTable
>>> table = LoadTable('stats.txt', sep=',')
>>> print(table)
====================================
    Locus    Region            Ratio
------------------------------------
NP_003077       Con           2.5386
NP_004893       Con      121351.4264
NP_005079       Con     9516594.9789
NP_005500    NonCon           0.0000
NP_055852    NonCon    10933217.7090
------------------------------------

Reading large files

For really large files the automated conversion used by the standard read mechanism can be quite slow. If the data within a column is consistently of one type, set the LoadTable argument static_column_types=True. This causes the Table object to create a custom reader.

>>> table = LoadTable('stats.txt', static_column_types=True)
>>> print(table)
====================================
    Locus    Region            Ratio
------------------------------------
NP_003077       Con           2.5386
NP_004893       Con      121351.4264
NP_005079       Con     9516594.9789
NP_005500    NonCon           0.0000
NP_055852    NonCon    10933217.7090
------------------------------------

Formatting

Changing displayed numerical precision

We change the Ratio column to using scientific notation.

>>> table.format_column('Ratio', '%.1e')
>>> print(table)
==============================
    Locus    Region      Ratio
------------------------------
NP_003077       Con    2.5e+00
NP_004893       Con    1.2e+05
NP_005079       Con    9.5e+06
NP_005500    NonCon    7.4e-08
NP_055852    NonCon    1.1e+07
------------------------------

Change digits or column spacing

This can be done on table loading,

>>> table = LoadTable('stats.txt', sep=',', digits=1, space=2)
>>> print(table)
=============================
    Locus  Region       Ratio
-----------------------------
NP_003077     Con         2.5
NP_004893     Con    121351.4
NP_005079     Con   9516595.0
NP_005500  NonCon         0.0
NP_055852  NonCon  10933217.7
-----------------------------

or, for spacing at least, by modifying the attributes

>>> table.space = '    '
>>> print(table)
=================================
    Locus    Region         Ratio
---------------------------------
NP_003077       Con           2.5
NP_004893       Con      121351.4
NP_005079       Con     9516595.0
NP_005500    NonCon           0.0
NP_055852    NonCon    10933217.7
---------------------------------

Changing column headings

The table header is immutable. Changing column headings is done as follows.

>>> table = LoadTable('stats.txt', sep=',')
>>> print(table.header)
['Locus', 'Region', 'Ratio']
>>> table = table.with_new_header('Ratio', 'Stat')
>>> print(table.header)
['Locus', 'Region', 'Stat']

Creating new columns from existing ones

This can be used to take a single, or multiple columns and generate a new column of values. Here we’ll take 2 columns and return True/False based on a condition.

>>> table = LoadTable('stats.txt', sep=',')
>>> table = table.with_new_column('LargeCon',
...                     lambda r_v: r_v[0] == 'Con' and r_v[1]>10.0,
...                     columns=['Region', 'Ratio'])
>>> print(table)
================================================
    Locus    Region            Ratio    LargeCon
------------------------------------------------
NP_003077       Con           2.5386       False
NP_004893       Con      121351.4264        True
NP_005079       Con     9516594.9789        True
NP_005500    NonCon           0.0000       False
NP_055852    NonCon    10933217.7090       False
------------------------------------------------

Appending tables

Can be done without specifying a new column. Here we simply use the same table data.

>>> table1 = LoadTable('stats.txt', sep=',')
>>> table2 = LoadTable('stats.txt', sep=',')
>>> table = table1.appended(None, table2)
>>> print(table)
====================================
    Locus    Region            Ratio
------------------------------------
NP_003077       Con           2.5386
NP_004893       Con      121351.4264
NP_005079       Con     9516594.9789
NP_005500    NonCon           0.0000
NP_055852    NonCon    10933217.7090
NP_003077       Con           2.5386
NP_004893       Con      121351.4264
NP_005079       Con     9516594.9789
NP_005500    NonCon           0.0000
NP_055852    NonCon    10933217.7090
------------------------------------

or with a new column

>>> table1.title = 'Data1'
>>> table2.title = 'Data2'
>>> table = table1.appended('Data#', table2, title='')
>>> print(table)
=============================================
Data#        Locus    Region            Ratio
---------------------------------------------
Data1    NP_003077       Con           2.5386
Data1    NP_004893       Con      121351.4264
Data1    NP_005079       Con     9516594.9789
Data1    NP_005500    NonCon           0.0000
Data1    NP_055852    NonCon    10933217.7090
Data2    NP_003077       Con           2.5386
Data2    NP_004893       Con      121351.4264
Data2    NP_005079       Con     9516594.9789
Data2    NP_005500    NonCon           0.0000
Data2    NP_055852    NonCon    10933217.7090
---------------------------------------------

Note

We assigned an empty string to title, otherwise the resulting table has the same title attribute as that of table1.

Summing a single column

>>> table = LoadTable('stats.txt', sep=',')
>>> table.summed('Ratio')
20571166.652847398

Summing multiple columns or rows - strictly numerical data

We define a strictly numerical table,

>>> all_numeric = LoadTable(header=['A', 'B', 'C'], rows=[range(3),
...                                 range(3,6), range(6,9), range(9,12)])
>>> print(all_numeric)
=============
A     B     C
-------------
0     1     2
3     4     5
6     7     8
9    10    11
-------------

and sum all columns (default condition)

>>> all_numeric.summed()
[18, 22, 26]

and all rows

>>> all_numeric.summed(col_sum=False)
[3, 12, 21, 30]

Summing multiple columns or rows with mixed non-numeric/numeric data

We define a table with mixed data, like a distance matrix.

>>> mixed = LoadTable(header=['A', 'B', 'C'], rows=[['*',1,2], [3,'*', 5],
...                                                 [6,7,'*']])
>>> print(mixed)
===========
A    B    C
-----------
*    1    2
3    *    5
6    7    *
-----------

and sum all columns (default condition), ignoring non-numerical data

>>> mixed.summed(strict=False)
[9, 8, 7]

and all rows

>>> mixed.summed(col_sum=False, strict=False)
[3, 8, 13]

Filtering table rows

We can do this by providing a reference to an external function

>>> table = LoadTable('stats.txt', sep=',')
>>> sub_table = table.filtered(lambda x: x < 10.0, columns='Ratio')
>>> print(sub_table)
=============================
    Locus    Region     Ratio
-----------------------------
NP_003077       Con    2.5386
NP_005500    NonCon    0.0000
-----------------------------

or using valid python syntax within a string, which is executed

>>> sub_table = table.filtered("Ratio < 10.0")
>>> print(sub_table)
=============================
    Locus    Region     Ratio
-----------------------------
NP_003077       Con    2.5386
NP_005500    NonCon    0.0000
-----------------------------

You can also filter for values in multiple columns

>>> sub_table = table.filtered("Ratio < 10.0 and Region == 'NonCon'")
>>> print(sub_table)
=============================
    Locus    Region     Ratio
-----------------------------
NP_005500    NonCon    0.0000
-----------------------------

Filtering table columns

We select only columns that have a sum > 20 from the all_numeric table constructed above.

>>> big_numeric = all_numeric.filtered_by_column(lambda x: sum(x)>20)
>>> print(big_numeric)
========
 B     C
--------
 1     2
 4     5
 7     8
10    11
--------

Sorting

Standard sorting

>>> table = LoadTable('stats.txt', sep=',')
>>> print(table.sorted(columns='Ratio'))
====================================
    Locus    Region            Ratio
------------------------------------
NP_005500    NonCon           0.0000
NP_003077       Con           2.5386
NP_004893       Con      121351.4264
NP_005079       Con     9516594.9789
NP_055852    NonCon    10933217.7090
------------------------------------

Reverse sorting

>>> print(table.sorted(columns='Ratio', reverse='Ratio'))
====================================
    Locus    Region            Ratio
------------------------------------
NP_055852    NonCon    10933217.7090
NP_005079       Con     9516594.9789
NP_004893       Con      121351.4264
NP_003077       Con           2.5386
NP_005500    NonCon           0.0000
------------------------------------

Sorting involving multiple columns, one reversed

>>> print(table.sorted(columns=['Region', 'Ratio'], reverse='Ratio'))
====================================
    Locus    Region            Ratio
------------------------------------
NP_005079       Con     9516594.9789
NP_004893       Con      121351.4264
NP_003077       Con           2.5386
NP_055852    NonCon    10933217.7090
NP_005500    NonCon           0.0000
------------------------------------

Getting raw data

For a single column

>>> table = LoadTable('stats.txt', sep=',')
>>> raw = table.tolist('Region')
>>> print(raw)
['Con', 'Con', 'Con', 'NonCon', 'NonCon']

For multiple columns

>>> table = LoadTable('stats.txt', sep=',')
>>> raw = table.tolist(['Locus', 'Region'])
>>> print(raw)
[['NP_003077', 'Con'], ['NP_004893', 'Con'], ...

Iterating over table rows

>>> table = LoadTable('stats.txt', sep=',')
>>> for row in table:
...     print(row['Locus'])
...
NP_003077
NP_004893
NP_005079
NP_005500
NP_055852

Table slicing

Using column names

>>> table = LoadTable('stats.txt', sep=',')
>>> print(table[:2, :'Region'])
=========
    Locus
---------
NP_003077
NP_004893
---------

Using column indices

>>> table = LoadTable('stats.txt', sep=',')
>>> print(table[:2,: 1])
=========
    Locus
---------
NP_003077
NP_004893
---------

SQL-like capabilities

Distinct values

>>> table = LoadTable('stats.txt', sep=',')
>>> assert table.distinct_values('Region') == set(['NonCon', 'Con'])

Counting

>>> table = LoadTable('stats.txt', sep=',')
>>> assert table.count("Region == 'NonCon' and Ratio > 1") == 1

Joining tables

SQL-like join operations requires tables have different title attributes which are not None. We do a standard inner join here for a restricted subset. We must specify the columns that will be used for the join. Here we just use Locus but multiple columns can be used, and their names can be different between the tables. Note that the second table’s title becomes a part of the column names.

>>> rows = [['NP_004893', True], ['NP_005079', True],
...         ['NP_005500', False], ['NP_055852', False]]
>>> region_type = LoadTable(header=['Locus', 'LargeCon'], rows=rows,
...                 title='RegionClass')
>>> stats_table = LoadTable('stats.txt', sep=',', title='Stats')
>>> new = stats_table.joined(region_type, columns_self='Locus')
>>> print(new)
============================================================
    Locus    Region            Ratio    RegionClass_LargeCon
------------------------------------------------------------
NP_004893       Con      121351.4264                    True
NP_005079       Con     9516594.9789                    True
NP_005500    NonCon           0.0000                   False
NP_055852    NonCon    10933217.7090                   False
------------------------------------------------------------

Exporting

Writing delimited formats

>>> table = LoadTable('stats.txt', sep=',')
>>> table.write('stats_tab.txt', sep='\t')

Writing latex format

It is also possible to specify column alignment, table caption and other arguments.

>>> table = LoadTable('stats.txt', sep=',')
>>> print(table.tostring(format='latex'))
\begin{longtable}[htp!]{ r r r }
\hline
\bf{Locus} & \bf{Region} & \bf{Ratio} \\
\hline
\hline
NP_003077 &    Con &        2.5386 \\
NP_004893 &    Con &   121351.4264 \\
NP_005079 &    Con &  9516594.9789 \\
NP_005500 & NonCon &        0.0000 \\
NP_055852 & NonCon & 10933217.7090 \\
\hline
\end{longtable}

Writing bedGraph format

This format allows display of annotation tracks on genome browsers.

>>> rows = [['1', 100, 101, 1.123], ['1', 101, 102, 1.123],
...         ['1', 102, 103, 1.123], ['1', 103, 104, 1.123],
...         ['1', 104, 105, 1.123], ['1', 105, 106, 1.123],
...         ['1', 106, 107, 1.123], ['1', 107, 108, 1.123],
...         ['1', 108, 109, 1], ['1', 109, 110, 1],
...         ['1', 110, 111, 1], ['1', 111, 112, 1],
...         ['1', 112, 113, 1], ['1', 113, 114, 1],
...         ['1', 114, 115, 1], ['1', 115, 116, 1],
...         ['1', 116, 117, 1], ['1', 117, 118, 1],
...         ['1', 118, 119, 2], ['1', 119, 120, 2],
...         ['1', 120, 121, 2], ['1', 150, 151, 2],
...         ['1', 151, 152, 2], ['1', 152, 153, 2],
...         ['1', 153, 154, 2], ['1', 154, 155, 2],
...         ['1', 155, 156, 2], ['1', 156, 157, 2],
...         ['1', 157, 158, 2], ['1', 158, 159, 2],
...         ['1', 159, 160, 2], ['1', 160, 161, 2]]
...
>>> bgraph = LoadTable(header=['chrom', 'start', 'end', 'value'],
...                   rows=rows)
...
>>> print(bgraph.tostring(format='bedgraph', name='test track',
...     description='test of bedgraph', color=(255,0,0))) 
track type=bedGraph name="test track" description="test of bedgraph" color=255,0,0
1   100     108     1.12
1   108     118     1.0
1   118     161     2.0

The bedgraph formatter defaults to rounding values to 2 decimal places. You can adjust that precision using the ``digits`` argument.
>>> print(bgraph.tostring(format='bedgraph', name='test track',
...   description='test of bedgraph', color=(255,0,0), digits=0)) 
track type=bedGraph name="test track" description="test of bedgraph" color=255,0,0
1   100     118     1.0
1   118     161     2.0