Tabular data ------------ .. authors, Gavin Huttley, Kristian Rother, Patrick Yannul .. doctest:: :hide: >>> # just saving some tabular data for subsequent data >>> from cogent3 import make_table, load_table >>> rows = (('NP_003077', 'Con', 2.5386013224378985), ... ('NP_004893', 'Con', 0.12135142635634111e+06), ... ('NP_005079', 'Con', 0.95165949788861326e+07), ... ('NP_005500', 'NonCon', 0.73827030202664901e-07), ... ('NP_055852', 'NonCon', 1.0933217708952725e+07)) >>> table = make_table(header=['Locus', 'Region', 'Ratio'], rows=rows) >>> table.write('stats.txt', sep=',') Loading delimited formats ^^^^^^^^^^^^^^^^^^^^^^^^^ We load a comma separated data file using the generic ``load_table`` function. .. doctest:: >>> from cogent3 import load_table >>> table = load_table('stats.txt', sep=',') >>> print(table) ==================================== Locus Region Ratio ------------------------------------ NP_003077 Con 2.5386 NP_004893 Con 121351.4264 NP_005079 Con 9516594.9789 NP_005500 NonCon 0.0000 NP_055852 NonCon 10933217.7090 ------------------------------------ Reading large files ^^^^^^^^^^^^^^^^^^^ For really large files the automated conversion used by the standard read mechanism can be quite slow. If the data within a column is consistently of one type, set the ``load_table`` argument ``static_column_types=True``. This causes the ``Table`` object to create a custom reader. .. doctest:: >>> table = load_table('stats.txt', static_column_types=True, sep=",") >>> print(table) ==================================== Locus Region Ratio ------------------------------------ NP_003077 Con 2.5386 NP_004893 Con 121351.4264 NP_005079 Con 9516594.9789 NP_005500 NonCon 0.0000 NP_055852 NonCon 10933217.7090 ------------------------------------ Formatting ^^^^^^^^^^ Changing displayed numerical precision """""""""""""""""""""""""""""""""""""" We change the ``Ratio`` column to using scientific notation. .. doctest:: >>> table.format_column('Ratio', '%.1e') >>> print(table) ============================== Locus Region Ratio ------------------------------ NP_003077 Con 2.5e+00 NP_004893 Con 1.2e+05 NP_005079 Con 9.5e+06 NP_005500 NonCon 7.4e-08 NP_055852 NonCon 1.1e+07 ------------------------------ Change digits or column spacing """"""""""""""""""""""""""""""" This can be done on table loading, .. doctest:: >>> table = load_table('stats.txt', sep=',', digits=1, space=2) >>> print(table) ============================= Locus Region Ratio ----------------------------- NP_003077 Con 2.5 NP_004893 Con 121351.4 NP_005079 Con 9516595.0 NP_005500 NonCon 0.0 NP_055852 NonCon 10933217.7 ----------------------------- or, for spacing at least, by modifying the attributes .. doctest:: >>> table.space = ' ' >>> print(table) ================================= Locus Region Ratio --------------------------------- NP_003077 Con 2.5 NP_004893 Con 121351.4 NP_005079 Con 9516595.0 NP_005500 NonCon 0.0 NP_055852 NonCon 10933217.7 --------------------------------- Changing column headings ^^^^^^^^^^^^^^^^^^^^^^^^ The table ``header`` is immutable. Changing column headings is done as follows. .. doctest:: >>> table = load_table('stats.txt', sep=',') >>> print(table.header) ['Locus', 'Region', 'Ratio'] >>> table = table.with_new_header('Ratio', 'Stat') >>> print(table.header) ['Locus', 'Region', 'Stat'] Creating new columns from existing ones ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This can be used to take a single, or multiple columns and generate a new column of values. Here we'll take 2 columns and return True/False based on a condition. .. doctest:: >>> table = load_table('stats.txt', sep=',') >>> table = table.with_new_column('LargeCon', ... lambda r_v: r_v[0] == 'Con' and r_v[1]>10.0, ... columns=['Region', 'Ratio']) >>> print(table) ================================================ Locus Region Ratio LargeCon ------------------------------------------------ NP_003077 Con 2.5386 False NP_004893 Con 121351.4264 True NP_005079 Con 9516594.9789 True NP_005500 NonCon 0.0000 False NP_055852 NonCon 10933217.7090 False ------------------------------------------------ Appending tables ^^^^^^^^^^^^^^^^ Can be done without specifying a new column. Here we simply use the same table data. .. doctest:: >>> table1 = load_table('stats.txt', sep=',') >>> table2 = load_table('stats.txt', sep=',') >>> table = table1.appended(None, table2) >>> print(table) ==================================== Locus Region Ratio ------------------------------------ NP_003077 Con 2.5386 NP_004893 Con 121351.4264 NP_005079 Con 9516594.9789 NP_005500 NonCon 0.0000 NP_055852 NonCon 10933217.7090 NP_003077 Con 2.5386 NP_004893 Con 121351.4264 NP_005079 Con 9516594.9789 NP_005500 NonCon 0.0000 NP_055852 NonCon 10933217.7090 ------------------------------------ or with a new column .. doctest:: >>> table1.title = 'Data1' >>> table2.title = 'Data2' >>> table = table1.appended('Data#', table2, title='') >>> print(table) ============================================= Data# Locus Region Ratio --------------------------------------------- Data1 NP_003077 Con 2.5386 Data1 NP_004893 Con 121351.4264 Data1 NP_005079 Con 9516594.9789 Data1 NP_005500 NonCon 0.0000 Data1 NP_055852 NonCon 10933217.7090 Data2 NP_003077 Con 2.5386 Data2 NP_004893 Con 121351.4264 Data2 NP_005079 Con 9516594.9789 Data2 NP_005500 NonCon 0.0000 Data2 NP_055852 NonCon 10933217.7090 --------------------------------------------- .. note:: We assigned an empty string to ``title``, otherwise the resulting table has the same ``title`` attribute as that of ``table1``. Summing a single column ^^^^^^^^^^^^^^^^^^^^^^^ .. doctest:: >>> table = load_table('stats.txt', sep=',') >>> table.summed('Ratio') 20571166.652... Summing multiple columns or rows - strictly numerical data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We define a strictly numerical table, .. doctest:: >>> all_numeric = make_table(header=['A', 'B', 'C'], rows=[range(3), ... range(3,6), range(6,9), range(9,12)]) >>> print(all_numeric) ============= A B C ------------- 0 1 2 3 4 5 6 7 8 9 10 11 ------------- and sum all columns (default condition) .. doctest:: >>> all_numeric.summed() [18, 22, 26] and all rows .. doctest:: >>> all_numeric.summed(col_sum=False) [3, 12, 21, 30] Summing multiple columns or rows with mixed non-numeric/numeric data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We define a table with mixed data, like a distance matrix. .. doctest:: >>> mixed = make_table(header=['A', 'B', 'C'], rows=[['*',1,2], [3,'*', 5], ... [6,7,'*']]) >>> print(mixed) =========== A B C ----------- * 1 2 3 * 5 6 7 * ----------- and sum all columns (default condition), ignoring non-numerical data .. doctest:: >>> mixed.summed(strict=False) [9, 8, 7] and all rows .. doctest:: >>> mixed.summed(col_sum=False, strict=False) [3, 8, 13] Filtering table rows ^^^^^^^^^^^^^^^^^^^^ We can do this by providing a reference to an external function .. doctest:: >>> table = load_table('stats.txt', sep=',') >>> sub_table = table.filtered(lambda x: x < 10.0, columns='Ratio') >>> print(sub_table) ============================= Locus Region Ratio ----------------------------- NP_003077 Con 2.5386 NP_005500 NonCon 0.0000 ----------------------------- or using valid python syntax within a string, which is executed .. doctest:: >>> sub_table = table.filtered("Ratio < 10.0") >>> print(sub_table) ============================= Locus Region Ratio ----------------------------- NP_003077 Con 2.5386 NP_005500 NonCon 0.0000 ----------------------------- You can also filter for values in multiple columns .. doctest:: >>> sub_table = table.filtered("Ratio < 10.0 and Region == 'NonCon'") >>> print(sub_table) ============================= Locus Region Ratio ----------------------------- NP_005500 NonCon 0.0000 ----------------------------- Filtering table columns ^^^^^^^^^^^^^^^^^^^^^^^ We select only columns that have a sum > 20 from the ``all_numeric`` table constructed above. .. doctest:: >>> big_numeric = all_numeric.filtered_by_column(lambda x: sum(x)>20) >>> print(big_numeric) ======== B C -------- 1 2 4 5 7 8 10 11 -------- Sorting ^^^^^^^ Standard sorting """""""""""""""" .. doctest:: >>> table = load_table('stats.txt', sep=',') >>> print(table.sorted(columns='Ratio')) ==================================== Locus Region Ratio ------------------------------------ NP_005500 NonCon 0.0000 NP_003077 Con 2.5386 NP_004893 Con 121351.4264 NP_005079 Con 9516594.9789 NP_055852 NonCon 10933217.7090 ------------------------------------ Reverse sorting """"""""""""""" .. doctest:: >>> print(table.sorted(columns='Ratio', reverse='Ratio')) ==================================== Locus Region Ratio ------------------------------------ NP_055852 NonCon 10933217.7090 NP_005079 Con 9516594.9789 NP_004893 Con 121351.4264 NP_003077 Con 2.5386 NP_005500 NonCon 0.0000 ------------------------------------ Sorting involving multiple columns, one reversed """""""""""""""""""""""""""""""""""""""""""""""" .. doctest:: >>> print(table.sorted(columns=['Region', 'Ratio'], reverse='Ratio')) ==================================== Locus Region Ratio ------------------------------------ NP_005079 Con 9516594.9789 NP_004893 Con 121351.4264 NP_003077 Con 2.5386 NP_055852 NonCon 10933217.7090 NP_005500 NonCon 0.0000 ------------------------------------ Getting raw data ^^^^^^^^^^^^^^^^ For a single column """"""""""""""""""" .. doctest:: >>> table = load_table('stats.txt', sep=',') >>> raw = table.tolist('Region') >>> print(raw) ['Con', 'Con', 'Con', 'NonCon', 'NonCon'] For multiple columns """""""""""""""""""" .. doctest:: >>> table = load_table('stats.txt', sep=',') >>> raw = table.tolist(['Locus', 'Region']) >>> print(raw) [['NP_003077', 'Con'], ['NP_004893', 'Con'], ... Iterating over table rows ^^^^^^^^^^^^^^^^^^^^^^^^^ .. doctest:: >>> table = load_table('stats.txt', sep=',') >>> for row in table: ... print(row['Locus']) ... NP_003077 NP_004893 NP_005079 NP_005500 NP_055852 Table slicing ^^^^^^^^^^^^^ Using column names """""""""""""""""" .. doctest:: >>> table = load_table('stats.txt', sep=',') >>> print(table[:2, :'Region']) ========= Locus --------- NP_003077 NP_004893 --------- Using column indices """""""""""""""""""" .. doctest:: >>> table = load_table('stats.txt', sep=',') >>> print(table[:2,: 1]) ========= Locus --------- NP_003077 NP_004893 --------- SQL-like capabilities ^^^^^^^^^^^^^^^^^^^^^ Distinct values """"""""""""""" .. doctest:: >>> table = load_table('stats.txt', sep=',') >>> assert table.distinct_values('Region') == set(['NonCon', 'Con']) Counting """""""" .. doctest:: >>> table = load_table('stats.txt', sep=',') >>> assert table.count("Region == 'NonCon' and Ratio > 1") == 1 Joining tables """""""""""""" SQL-like join operations requires tables have different ``title`` attributes which are not ``None``. We do a standard inner join here for a restricted subset. We must specify the columns that will be used for the join. Here we just use ``Locus`` but multiple columns can be used, and their names can be different between the tables. Note that the second table's title becomes a part of the column names. .. doctest:: >>> rows = [['NP_004893', True], ['NP_005079', True], ... ['NP_005500', False], ['NP_055852', False]] >>> region_type = make_table(header=['Locus', 'LargeCon'], rows=rows, ... title='RegionClass') >>> stats_table = load_table('stats.txt', sep=',', title='Stats') >>> new = stats_table.joined(region_type, columns_self='Locus') >>> print(new) ============================================================ Locus Region Ratio RegionClass_LargeCon ------------------------------------------------------------ NP_004893 Con 121351.4264 True NP_005079 Con 9516594.9789 True NP_005500 NonCon 0.0000 False NP_055852 NonCon 10933217.7090 False ------------------------------------------------------------ Exporting ^^^^^^^^^ Writing delimited formats """"""""""""""""""""""""" .. doctest:: >>> table = load_table('stats.txt', sep=',') >>> table.write('stats_tab.txt', sep='\t') Writing latex format """""""""""""""""""" It is also possible to specify column alignment, table caption and other arguments. .. doctest:: >>> table = load_table('stats.txt', sep=',') >>> print(table.to_string(format='latex')) \begin{table}[htp!] \centering \begin{tabular}{ r r r } \hline \bf{Locus} & \bf{Region} & \bf{Ratio} \\ \hline \hline NP_003077 & Con & 2.5386 \\ NP_004893 & Con & 121351.4264 \\ NP_005079 & Con & 9516594.9789 \\ NP_005500 & NonCon & 0.0000 \\ NP_055852 & NonCon & 10933217.7090 \\ \hline \end{tabular} \end{table} .. we remove the table data .. doctest:: :hide: >>> import os >>> os.remove('stats.txt') >>> os.remove('stats_tab.txt') Writing bedGraph format """"""""""""""""""""""" This format allows display of annotation tracks on genome browsers. .. doctest:: :options: +NORMALIZE_WHITESPACE >>> rows = [['1', 100, 101, 1.123], ['1', 101, 102, 1.123], ... ['1', 102, 103, 1.123], ['1', 103, 104, 1.123], ... ['1', 104, 105, 1.123], ['1', 105, 106, 1.123], ... ['1', 106, 107, 1.123], ['1', 107, 108, 1.123], ... ['1', 108, 109, 1], ['1', 109, 110, 1], ... ['1', 110, 111, 1], ['1', 111, 112, 1], ... ['1', 112, 113, 1], ['1', 113, 114, 1], ... ['1', 114, 115, 1], ['1', 115, 116, 1], ... ['1', 116, 117, 1], ['1', 117, 118, 1], ... ['1', 118, 119, 2], ['1', 119, 120, 2], ... ['1', 120, 121, 2], ['1', 150, 151, 2], ... ['1', 151, 152, 2], ['1', 152, 153, 2], ... ['1', 153, 154, 2], ['1', 154, 155, 2], ... ['1', 155, 156, 2], ['1', 156, 157, 2], ... ['1', 157, 158, 2], ['1', 158, 159, 2], ... ['1', 159, 160, 2], ['1', 160, 161, 2]] ... >>> bgraph = make_table(header=['chrom', 'start', 'end', 'value'], ... rows=rows) ... >>> print(bgraph.to_string(format='bedgraph', name='test track', ... description='test of bedgraph', color=(255,0,0))) # doctest: +NORMALIZE_WHITESPACE track type=bedGraph name="test track" description="test of bedgraph" color=255,0,0 1 100 108 1.12 1 108 118 1.00 1 118 161 2.00 The bedgraph formatter defaults to rounding values to 2 decimal places. You can adjust that precision using the ``digits`` argument. .. doctest:: :options: +NORMALIZE_WHITESPACE >>> print(bgraph.to_string(format='bedgraph', name='test track', ... description='test of bedgraph', color=(255,0,0), digits=0)) # doctest: +NORMALIZE_WHITESPACE track type=bedGraph name="test track" description="test of bedgraph" color=255,0,0 1 100 118 1.00 1 118 161 2.00