Tabular data¶
Loading delimited formats¶
We load a comma separated data file using the generic load_table
function.
>>> from cogent3 import load_table
>>> table = load_table('stats.txt', sep=',')
>>> print(table)
====================================
Locus Region Ratio
------------------------------------
NP_003077 Con 2.5386
NP_004893 Con 121351.4264
NP_005079 Con 9516594.9789
NP_005500 NonCon 0.0000
NP_055852 NonCon 10933217.7090
------------------------------------
Reading large files¶
For really large files the automated conversion used by the standard read mechanism can be quite slow. If the data within a column is consistently of one type, set the load_table
argument static_column_types=True
. This causes the Table
object to create a custom reader.
>>> table = load_table('stats.txt', static_column_types=True, sep=",")
>>> print(table)
====================================
Locus Region Ratio
------------------------------------
NP_003077 Con 2.5386
NP_004893 Con 121351.4264
NP_005079 Con 9516594.9789
NP_005500 NonCon 0.0000
NP_055852 NonCon 10933217.7090
------------------------------------
Formatting¶
Changing displayed numerical precision¶
We change the Ratio
column to using scientific notation.
>>> table.format_column('Ratio', '%.1e')
>>> print(table)
==============================
Locus Region Ratio
------------------------------
NP_003077 Con 2.5e+00
NP_004893 Con 1.2e+05
NP_005079 Con 9.5e+06
NP_005500 NonCon 7.4e-08
NP_055852 NonCon 1.1e+07
------------------------------
Change digits or column spacing¶
This can be done on table loading,
>>> table = load_table('stats.txt', sep=',', digits=1, space=2)
>>> print(table)
=============================
Locus Region Ratio
-----------------------------
NP_003077 Con 2.5
NP_004893 Con 121351.4
NP_005079 Con 9516595.0
NP_005500 NonCon 0.0
NP_055852 NonCon 10933217.7
-----------------------------
or, for spacing at least, by modifying the attributes
>>> table.space = ' '
>>> print(table)
=================================
Locus Region Ratio
---------------------------------
NP_003077 Con 2.5
NP_004893 Con 121351.4
NP_005079 Con 9516595.0
NP_005500 NonCon 0.0
NP_055852 NonCon 10933217.7
---------------------------------
Changing column headings¶
The table header
is immutable. Changing column headings is done as follows.
>>> table = load_table('stats.txt', sep=',')
>>> print(table.header)
['Locus', 'Region', 'Ratio']
>>> table = table.with_new_header('Ratio', 'Stat')
>>> print(table.header)
['Locus', 'Region', 'Stat']
Creating new columns from existing ones¶
This can be used to take a single, or multiple columns and generate a new column of values. Here we’ll take 2 columns and return True/False based on a condition.
>>> table = load_table('stats.txt', sep=',')
>>> table = table.with_new_column('LargeCon',
... lambda r_v: r_v[0] == 'Con' and r_v[1]>10.0,
... columns=['Region', 'Ratio'])
>>> print(table)
================================================
Locus Region Ratio LargeCon
------------------------------------------------
NP_003077 Con 2.5386 False
NP_004893 Con 121351.4264 True
NP_005079 Con 9516594.9789 True
NP_005500 NonCon 0.0000 False
NP_055852 NonCon 10933217.7090 False
------------------------------------------------
Appending tables¶
Can be done without specifying a new column. Here we simply use the same table data.
>>> table1 = load_table('stats.txt', sep=',')
>>> table2 = load_table('stats.txt', sep=',')
>>> table = table1.appended(None, table2)
>>> print(table)
====================================
Locus Region Ratio
------------------------------------
NP_003077 Con 2.5386
NP_004893 Con 121351.4264
NP_005079 Con 9516594.9789
NP_005500 NonCon 0.0000
NP_055852 NonCon 10933217.7090
NP_003077 Con 2.5386
NP_004893 Con 121351.4264
NP_005079 Con 9516594.9789
NP_005500 NonCon 0.0000
NP_055852 NonCon 10933217.7090
------------------------------------
or with a new column
>>> table1.title = 'Data1'
>>> table2.title = 'Data2'
>>> table = table1.appended('Data#', table2, title='')
>>> print(table)
=============================================
Data# Locus Region Ratio
---------------------------------------------
Data1 NP_003077 Con 2.5386
Data1 NP_004893 Con 121351.4264
Data1 NP_005079 Con 9516594.9789
Data1 NP_005500 NonCon 0.0000
Data1 NP_055852 NonCon 10933217.7090
Data2 NP_003077 Con 2.5386
Data2 NP_004893 Con 121351.4264
Data2 NP_005079 Con 9516594.9789
Data2 NP_005500 NonCon 0.0000
Data2 NP_055852 NonCon 10933217.7090
---------------------------------------------
Note
We assigned an empty string to title
, otherwise the resulting table has the same title
attribute as that of table1
.
Summing a single column¶
>>> table = load_table('stats.txt', sep=',')
>>> table.summed('Ratio')
20571166.652...
Summing multiple columns or rows - strictly numerical data¶
We define a strictly numerical table,
>>> all_numeric = make_table(header=['A', 'B', 'C'], rows=[range(3),
... range(3,6), range(6,9), range(9,12)])
>>> print(all_numeric)
=============
A B C
-------------
0 1 2
3 4 5
6 7 8
9 10 11
-------------
and sum all columns (default condition)
>>> all_numeric.summed()
[18, 22, 26]
and all rows
>>> all_numeric.summed(col_sum=False)
[3, 12, 21, 30]
Summing multiple columns or rows with mixed non-numeric/numeric data¶
We define a table with mixed data, like a distance matrix.
>>> mixed = make_table(header=['A', 'B', 'C'], rows=[['*',1,2], [3,'*', 5],
... [6,7,'*']])
>>> print(mixed)
===========
A B C
-----------
* 1 2
3 * 5
6 7 *
-----------
and sum all columns (default condition), ignoring non-numerical data
>>> mixed.summed(strict=False)
[9, 8, 7]
and all rows
>>> mixed.summed(col_sum=False, strict=False)
[3, 8, 13]
Filtering table rows¶
We can do this by providing a reference to an external function
>>> table = load_table('stats.txt', sep=',')
>>> sub_table = table.filtered(lambda x: x < 10.0, columns='Ratio')
>>> print(sub_table)
=============================
Locus Region Ratio
-----------------------------
NP_003077 Con 2.5386
NP_005500 NonCon 0.0000
-----------------------------
or using valid python syntax within a string, which is executed
>>> sub_table = table.filtered("Ratio < 10.0")
>>> print(sub_table)
=============================
Locus Region Ratio
-----------------------------
NP_003077 Con 2.5386
NP_005500 NonCon 0.0000
-----------------------------
You can also filter for values in multiple columns
>>> sub_table = table.filtered("Ratio < 10.0 and Region == 'NonCon'")
>>> print(sub_table)
=============================
Locus Region Ratio
-----------------------------
NP_005500 NonCon 0.0000
-----------------------------
Filtering table columns¶
We select only columns that have a sum > 20 from the all_numeric
table constructed above.
>>> big_numeric = all_numeric.filtered_by_column(lambda x: sum(x)>20)
>>> print(big_numeric)
========
B C
--------
1 2
4 5
7 8
10 11
--------
Sorting¶
Standard sorting¶
>>> table = load_table('stats.txt', sep=',')
>>> print(table.sorted(columns='Ratio'))
====================================
Locus Region Ratio
------------------------------------
NP_005500 NonCon 0.0000
NP_003077 Con 2.5386
NP_004893 Con 121351.4264
NP_005079 Con 9516594.9789
NP_055852 NonCon 10933217.7090
------------------------------------
Reverse sorting¶
>>> print(table.sorted(columns='Ratio', reverse='Ratio'))
====================================
Locus Region Ratio
------------------------------------
NP_055852 NonCon 10933217.7090
NP_005079 Con 9516594.9789
NP_004893 Con 121351.4264
NP_003077 Con 2.5386
NP_005500 NonCon 0.0000
------------------------------------
Sorting involving multiple columns, one reversed¶
>>> print(table.sorted(columns=['Region', 'Ratio'], reverse='Ratio'))
====================================
Locus Region Ratio
------------------------------------
NP_005079 Con 9516594.9789
NP_004893 Con 121351.4264
NP_003077 Con 2.5386
NP_055852 NonCon 10933217.7090
NP_005500 NonCon 0.0000
------------------------------------
Getting raw data¶
For a single column¶
>>> table = load_table('stats.txt', sep=',')
>>> raw = table.tolist('Region')
>>> print(raw)
['Con', 'Con', 'Con', 'NonCon', 'NonCon']
For multiple columns¶
>>> table = load_table('stats.txt', sep=',')
>>> raw = table.tolist(['Locus', 'Region'])
>>> print(raw)
[['NP_003077', 'Con'], ['NP_004893', 'Con'], ...
Iterating over table rows¶
>>> table = load_table('stats.txt', sep=',')
>>> for row in table:
... print(row['Locus'])
...
NP_003077
NP_004893
NP_005079
NP_005500
NP_055852
Table slicing¶
Using column names¶
>>> table = load_table('stats.txt', sep=',')
>>> print(table[:2, :'Region'])
=========
Locus
---------
NP_003077
NP_004893
---------
Using column indices¶
>>> table = load_table('stats.txt', sep=',')
>>> print(table[:2,: 1])
=========
Locus
---------
NP_003077
NP_004893
---------
SQL-like capabilities¶
Distinct values¶
>>> table = load_table('stats.txt', sep=',')
>>> assert table.distinct_values('Region') == set(['NonCon', 'Con'])
Counting¶
>>> table = load_table('stats.txt', sep=',')
>>> assert table.count("Region == 'NonCon' and Ratio > 1") == 1
Joining tables¶
SQL-like join operations requires tables have different title
attributes which are not None
. We do a standard inner join here for a restricted subset. We must specify the columns that will be used for the join. Here we just use Locus
but multiple columns can be used, and their names can be different between the tables. Note that the second table’s title becomes a part of the column names.
>>> rows = [['NP_004893', True], ['NP_005079', True],
... ['NP_005500', False], ['NP_055852', False]]
>>> region_type = make_table(header=['Locus', 'LargeCon'], rows=rows,
... title='RegionClass')
>>> stats_table = load_table('stats.txt', sep=',', title='Stats')
>>> new = stats_table.joined(region_type, columns_self='Locus')
>>> print(new)
============================================================
Locus Region Ratio RegionClass_LargeCon
------------------------------------------------------------
NP_004893 Con 121351.4264 True
NP_005079 Con 9516594.9789 True
NP_005500 NonCon 0.0000 False
NP_055852 NonCon 10933217.7090 False
------------------------------------------------------------
Exporting¶
Writing delimited formats¶
>>> table = load_table('stats.txt', sep=',')
>>> table.write('stats_tab.txt', sep='\t')
Writing latex format¶
It is also possible to specify column alignment, table caption and other arguments.
>>> table = load_table('stats.txt', sep=',')
>>> print(table.to_string(format='latex'))
\begin{table}[htp!]
\centering
\begin{tabular}{ r r r }
\hline
\bf{Locus} & \bf{Region} & \bf{Ratio} \\
\hline
\hline
NP_003077 & Con & 2.5386 \\
NP_004893 & Con & 121351.4264 \\
NP_005079 & Con & 9516594.9789 \\
NP_005500 & NonCon & 0.0000 \\
NP_055852 & NonCon & 10933217.7090 \\
\hline
\end{tabular}
\end{table}
Writing bedGraph format¶
This format allows display of annotation tracks on genome browsers.
>>> rows = [['1', 100, 101, 1.123], ['1', 101, 102, 1.123],
... ['1', 102, 103, 1.123], ['1', 103, 104, 1.123],
... ['1', 104, 105, 1.123], ['1', 105, 106, 1.123],
... ['1', 106, 107, 1.123], ['1', 107, 108, 1.123],
... ['1', 108, 109, 1], ['1', 109, 110, 1],
... ['1', 110, 111, 1], ['1', 111, 112, 1],
... ['1', 112, 113, 1], ['1', 113, 114, 1],
... ['1', 114, 115, 1], ['1', 115, 116, 1],
... ['1', 116, 117, 1], ['1', 117, 118, 1],
... ['1', 118, 119, 2], ['1', 119, 120, 2],
... ['1', 120, 121, 2], ['1', 150, 151, 2],
... ['1', 151, 152, 2], ['1', 152, 153, 2],
... ['1', 153, 154, 2], ['1', 154, 155, 2],
... ['1', 155, 156, 2], ['1', 156, 157, 2],
... ['1', 157, 158, 2], ['1', 158, 159, 2],
... ['1', 159, 160, 2], ['1', 160, 161, 2]]
...
>>> bgraph = make_table(header=['chrom', 'start', 'end', 'value'],
... rows=rows)
...
>>> print(bgraph.to_string(format='bedgraph', name='test track',
... description='test of bedgraph', color=(255,0,0)))
track type=bedGraph name="test track" description="test of bedgraph" color=255,0,0
1 100 108 1.12
1 108 118 1.00
1 118 161 2.00
The bedgraph formatter defaults to rounding values to 2 decimal places. You can adjust that precision using the ``digits`` argument.
>>> print(bgraph.to_string(format='bedgraph', name='test track',
... description='test of bedgraph', color=(255,0,0), digits=0))
track type=bedGraph name="test track" description="test of bedgraph" color=255,0,0
1 100 118 1.00
1 118 161 2.00