These are chat archives for bigdatafoundation/qngene

18th
Feb 2016
David Lauzon
@davidonlaptop
Feb 18 2016 16:04
hey je viens de voir que Spark peut inférer automatiquement les schema de donnees d’un CSV
voici le code de mon notebook zeppelin:
d’abord je charge la dependance du module CSV
%dep
z.reset()
z.load("com.databricks:spark-csv_2.10:1.3.0”)

DepInterpreter(%dep) deprecated. Remove dependencies and repositories through GUI interpreter menu instead.
DepInterpreter(%dep) deprecated. Load dependency through GUI interpreter menu instead.
res0: org.apache.zeppelin.dep.Dependency = org.apache.zeppelin.dep.Dependency@5be2dac5
ensuite, j’ai redemarrer l’interpreteur spark (onglet Interpreter > Spark > restart)
puis executer le premier bloc ci-haut au debut du notebook
ensuite j’ai rajouter un bloc de code scala:
%spark
val csv = sqlContext.read.format("com.databricks.spark.csv")
    .option("header","true").option("inferSchema","true").option("delimiter","\t")
    .load("file:///Users/david/workspace/genomic/optithera-scripts/doc/data-dictionnary/buffer.txt")

// Show data in a SQL like table (the fields are shortened to fit in one line)
csv.show

// Print column names
println( "\tColumn names:\t" + csv.columns.deep.mkString("\t") )

// Automatic detection of the schema
print( "\tData types:\t")
for (item <- csv.dtypes)  print(item + "\t")
println("")

// Or use this shortcut if you just want to print it
println("\tSchema:")
csv.printSchema

// Show descriptive statistics
println("\tDescriptive statistics:")
csv.describe( "idphenotypes", "nom", "nom_view", "description", "type", "querry", "covariable_choosing_table", "characteristic_table" ).show

csv.registerTempTable("mytable")
sqlContext.cacheTable("mytable")    // Fixes https://github.com/databricks/spark-csv/issues/83
Output:
csv: org.apache.spark.sql.DataFrame = [idphenotypes: int, nom: string, nom_view: string, description: string, type: string, querry: string, covariable_choosing_table: string, characteristic_table: string]
+------------+---------------+--------------------+--------------------+------------+--------------------+-------------------------+--------------------+
|idphenotypes|            nom|            nom_view|         description|        type|              querry|covariable_choosing_table|characteristic_table|
+------------+---------------+--------------------+--------------------+------------+--------------------+-------------------------+--------------------+
|           1|All cause death|view_med_all_caus...|All individuals t...|    binomial|SELECT p.id, p.sa...|     @10.54.203.119:/s...|@10.54.203.119:/s...|
|           2|            age|view_pph_age_floa...|Age of the person...|quantitative|SELECT p.id, p.sa...|                         |                    |
|           3|            sex|view_pph_sex_base...|sex of the person...| qualitative| SELECT person.id...|                         |                    |
|           4|            pc1| view_pc1_eigenstrat|genetic principal...|quantitative|SELECT pca.id, pc...|                         |                    |
|           5|            pc2| view_pc2_eigenstrat|genetic principal...|quantitative|SELECT pca.id, pc...|                         |                    |
+------------+---------------+--------------------+--------------------+------------+--------------------+-------------------------+--------------------+
    Column names:    idphenotypes    nom    nom_view    description    type    querry    covariable_choosing_table    characteristic_table
    Data types:    (idphenotypes,IntegerType)    (nom,StringType)    (nom_view,StringType)    (description,StringType)    (type,StringType)    (querry,StringType)    (covariable_choosing_table,StringType)    (characteristic_table,StringType)    
    Schema:
root
 |-- idphenotypes: integer (nullable = true)
 |-- nom: string (nullable = true)
 |-- nom_view: string (nullable = true)
 |-- description: string (nullable = true)
 |-- type: string (nullable = true)
 |-- querry: string (nullable = true)
 |-- covariable_choosing_table: string (nullable = true)
 |-- characteristic_table: string (nullable = true)
    Descriptive statistics:
+-------+------------------+---------------+--------------------+--------------------+------------+--------------------+-------------------------+--------------------+
|summary|      idphenotypes|            nom|            nom_view|         description|        type|              querry|covariable_choosing_table|characteristic_table|
+-------+------------------+---------------+--------------------+--------------------+------------+--------------------+-------------------------+--------------------+
|  count|                 5|              5|                   5|                   5|           5|                   5|                        5|                   5|
|   mean|               3.0|           null|                null|                null|        null|                null|                     null|                null|
| stddev|1.4142135623730951|           null|                null|                null|        null|                null|                     null|                null|
|    min|                 1|All cause death|view_med_all_caus...|Age of the person...|    binomial| SELECT person.id...|                         |                    |
|    max|                 5|            sex|view_pph_sex_base...|sex of the person...|quantitative|SELECT pca.id, pc...|     @10.54.203.119:/s...|@10.54.203.119:/s...|
+-------+------------------+---------------+--------------------+--------------------+------------+--------------------+-------------------------+--------------------+
David Lauzon
@davidonlaptop
Feb 18 2016 16:09
ce bloc prend un fichier CSV (separateur = TAB) en entree et demontre une utilisation de base d’un dataframe avec Spark
ca devrait fonctionner avec n’importe quel fichier csv
sauf la ligne csv.describe( "idphenotypes", "nom", "nom_view", "description", "type", "querry", "covariable_choosing_table", "characteristic_table" ).show ou j’ai hardcodé le noms des champs
David Lauzon
@davidonlaptop
Feb 18 2016 16:14
les bouts importants:
chargement du fichier:
val csv = sqlContext.read.format("com.databricks.spark.csv")
    .option("header","true").option("inferSchema","true").option("delimiter","\t")
    .load("file:///Users/david/workspace/genomic/optithera-scripts/doc/data-dictionnary/buffer.txt”)
et pour afficher le schema détecté seulement:
csv.printSchema
les 2 dernières lignes servent seulement a créer une table qui pourra être utilisée dans un bloc SQL
csv.registerTempTable("mytable")
sqlContext.cacheTable("mytable")
ensuite dans le bloc SQL j’ecris:
%sql
SELECT * FROM mytable
et automatiquement Zeppelin affiche un tableau qui peut être changer en graphique en 1 seul clique
David Lauzon
@davidonlaptop
Feb 18 2016 16:20
j’espere que ca pourra vous être utile pour démarrer en scala