by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Willy Lulciuc
    @wslulciuc
    @frankcash Yep! I've looked over all issues and PRs. Will be getting back to them throughout the day! (Also, thanks for the ping!)
    First, I wanted to backfill our CHANGELOG.md before merging new features, bugs, etc etc (https://github.com/MarquezProject/marquez/blob/master/CHANGELOG.md)
    Frank Cash
    @frankcash
    Sounds good!
    Frank Cash
    @frankcash
    @wslulciuc thank you so much for the feedback so far, I truly appreciate it :). Could you prioritize review of this PR MarquezProject/marquez-web#92 it would fix a super annoying bug
    Willy Lulciuc
    @wslulciuc
    @frankcash Sure! Let me give the PR a review now. Also, sure thing, thanks for your contributions!
    Willy Lulciuc
    @wslulciuc
    @frankcash Merged! I also merged your PR introducing a link to API docs: MarquezProject/marquez-web#93
    Let me know if you'd like to pick up anymore UI issues :)
    Frank Cash
    @frankcash
    for sure! I'm not usually much of a front-end person, but I will keep my eyes open
    Frank Cash
    @frankcash
    @wslulciuc can you provision a new docker image for marquez-web after reviewing MarquezProject/marquez-web#99 and MarquezProject/marquez-web#98 ?
    Willy Lulciuc
    @wslulciuc
    @frankcash Absolutely. Let's plan for a 0.5.0 release today. I should get around to reviewing your recent PRs later today (thanks btw!).
    Frank Cash
    @frankcash
    thanks Willy, I really appreciate it
    Willy Lulciuc
    @wslulciuc
    @frankcash I'm also planning to complete / merge MarquezProject/marquez#708 (configurable sources!) by the end of this week. It will introduce some breaking changes. Not sure if you will be affected.
    Willy Lulciuc
    @wslulciuc
    @frankcash I released marquez-web 0.5.1 that merges both of your tagging fixes. Thanks for the help!
    Frank Cash
    @frankcash
    awesome, thank you
    anuragabm
    @anuragabm
    I am new to marqeuz ...my primary requirement is to achieve data lineage from source to destination . as my data lands from different sources like cassandra , sql servers etc and lands into s3 and after lot of transformation moved into snowflake . how can i achieve data lineage with the help of marquez ? I am using Airflow of orchestration purpose .
    Henry Saputra
    @hsaputra
    Hi, I am new to Marquez wondering if there is any users or developers guide to know more about the features and configuration options for Marquez? Thanks
    Frank Cash
    @frankcash
    @hsaputra I've only been using marquez for about 2.5 weeks, but I've found diving into the source code has been an excellent way to learn about it when I hit blockers. But I originally started with this https://www.youtube.com/watch?v=BIVUXruv5io and the API documentation https://marquezproject.github.io/marquez/openapi.html after following the quick start guide
    @anuragabm that answer depends on how you wish to interact with Marquez, but you could use the marquez-airflow plugin https://github.com/MarquezProject/marquez-airflow (please note that this plug-in is semi-opinonated), which would require you to initialize your datasets using a different tool OR you could manage them with https://github.com/MarquezProject/marquez-python by embedding it into your DAG and having an adhoc job to manage to the datasets
    Henry Saputra
    @hsaputra
    Thanks @frankcash . I am interested to know more about the Data Quality frameworks connection (saw it from https://www.slideshare.net/WillyLulciuc/marquez-an-open-source-metadata-service-for-ml-platforms) from Marquez. I could not find reference to Griffin or Great Expectations from the source code.
    ponizvezdochka
    @ponizvezdochka
    Hi! I'm new to Marquez and wondering is there a way to add datasets and edit dataset metadata (tags, fields) through UI? Thank you!
    Siri-balupalli
    @Siri-balupalli
    Team...using docker compose i build the marquez...all were up, but not able to see the seed_data...help needed!
    Willy Lulciuc
    @wslulciuc

    @ashwinsingh Welcome! Great to hear you are excited about Marquez :) And thanks for helping out (above) by providing steps to prevent the UI crashing. I've reported your
    scenario as a bug MarquezProject/marquez-web#103

    Also i am wondering if there is a way to increase the size of the lineage chart without having to zoom in ?

    Yeah, that's a very reasonable feature for the lineage graph (MarquezProject/marquez-web#102).

    Willy Lulciuc
    @wslulciuc

    trying to show lineage between rds to rds jobs with airflow, but the example just shows input urns as S3. Is someone able to show how to format the airflow dag arguments for marquez? just what are the marquez options to show a job derives from a data source?

    The usage docs in the readme for marquez-airflow are a bit outdated, but we are planning to improve both the docs and the implementation within the coming weeks. That is, having to specify the inputs / outputs for your DAG isn't ideal (and the URN format is an early approach we moved away from). The lib does have a SQL parser that we experimented with at WeWork. It worked well for inspecting the SQL executed by a task (=job). At a high-level, the tables that are within the SQL (FROM, JOINS, etc) become inputs / outputs for the job and handled by marquez-airflow, including collecting (and sending to Marquez) the run state, code location in github, etc.

    I am able to successfully create the task if I go through the API before triggering the DAG, but that's kind of lame

    Yeah, very lame. The goal is to have marquez-airflow collect all of this metadata for you. We'll soon have s3 as a supported dataset, I'm currently working on closing the epic MarquezProject/marquez#708. That said, you'd still need to pre-register the source (i.e. RDS) before you could begin cataloging the tables within that instance using the source API

    Willy Lulciuc
    @wslulciuc

    I am new to marqeuz ...my primary requirement is to achieve data lineage from source to destination . as my data lands from different sources like cassandra , sql servers etc and lands into s3 and after lot of transformation moved into snowflake . how can i achieve data lineage with the help of marquez ? I am using Airflow of orchestration purpose .

    @anuragabm :wave: Welcome! Yes, you absolutely can! When registering job metadata with Marquez, the data model will ensure inputs / outputs are correctly associated with a job. You can checkout our design docs here https://marquezproject.github.io/marquez. We are also planning (as mentioned above) to improve both the docs and the implementation within the coming weeks for marquez-airflowthat will autmatically collect lineage metdata for DAGs in airflow. Support for easily configuring sources like cassandra is currently being worked on MarquezProject/marquez#708

    I'm also happy to answer any specific questions about data lineage, our design, the current state of the project, etc :)
    Willy Lulciuc
    @wslulciuc

    @hsaputra :wave: Welcome to the channel!

    Hi, I am new to Marquez wondering if there is any users or developers guide to know more about the features and configuration options for Marquez? Thanks

    We have our general design doc (https://marquezproject.github.io/marquez). You can also customize Marquez in the config.ymlfile passed to the HTTP server (see
    https://github.com/MarquezProject/marquez/blob/master/config.example.yml#L69). We also have an API doc (https://marquezproject.github.io/marquez/openapi.html). Let me know if these docs are helpful (or if we can improve them). Thanks!

    1 reply
    Frank Cash
    @frankcash
    when looking at a dataset would it be cool if you could see the source and the jdbc for the source?
    anuragabm
    @anuragabm
    @wslulciuc thank you Willy for your response !!my main requirement is data lineage ...i was referring marquez_airflow (https://github.com/MarquezProject/marquez-airflow) but in the example given default_args 'marquez_location': 'github://data-dags/dag_location/',
    'marquez_input_urns': ["s3://some_data", "s3://more_data"],
    'marquez_output_urns': ["s3://output_data"], .....is this more related to data collection ? I am not able to conclude on data lineage . is there some doc or example available data lineage from data read from s3 , transformed with the help of spark and stored data into snowflake . and in snowflake itself new table creation and loading data into table from multiple tables .
    Frank Cash
    @frankcash
    is there an ideal endpoint in marquez to use as a healthcheck
    Willy Lulciuc
    @wslulciuc

    is there an ideal endpoint in marquez to use as a healthcheck

    @frankcash Yep! You can use http://localhost:5001/healthcheck:

    {
      "deadlocks" : {
        "healthy" : true,
        "duration" : 0,
        "timestamp" : "2020-05-15T07:25:45.258Z"
      },
      "postgresql" : {
        "healthy" : true,
        "duration" : 9,
        "timestamp" : "2020-05-15T07:25:45.257Z"
      }
    }
    200 OK

    Note that you have to use the admin port (=5001)

    Willy Lulciuc
    @wslulciuc

    when looking at a dataset would it be cool if you could see the source and the jdbc for the source?

    Absolutely! We had planned to make source metadata viewable via the UI. Mind opening an issue against marquez-web to start / capture the discussion? Also, let me know if this is a UI feature you'd like to contribute!

    Willy Lulciuc
    @wslulciuc
    @anuragabm Great! A jobs inputs / outputs are stored in the metadata DB for Marquez. This allows us to build and display the lineage graph in marquez-web. The marquez-airflow lib was in early development at WeWork and was used to render the lineage graph for Airflow instances in production. The integration was also used to drive the implementation of Marquez. That said, the lib needs a few iterations before it can be used more generally. We'll be working on adding docs, updating the API, etc for marquez-airflow closer to the end of this month. We'll also be working on an integration for spark to capture input / output metadata from a spark jobs (shortly after we feel the airflow integration is in a good state). I'll be wrapping up shortly MarquezProject/marquez#708 allowing for sources like s3, snowflake, etc to be configured, then switching over to a design proposal for the marquez-airflow lib. Would you be interested in providing feedback to the design?
    Willy Lulciuc
    @wslulciuc

    is there some doc or example available data lineage from data read from s3 , transformed with the help of spark and stored data into snowflake . and in snowflake itself new table creation and loading data into table from multiple tables .

    @anuragabm This is a perfect use-case for Marquez and steps that can be tracked with the current API and data model. We've been working on example docs and will make those available as soon as we release a stable version of marquez-airflow.

    Team...using docker compose i build the marquez...all were up, but not able to see the seed_data...help needed!

    @Siri-balupalli Awesome! Do you have any logs to share or output from the terminal? And are you able to browse / view the Marquez UI?

    4 replies
    Frank Cash
    @frankcash
    @wslulciuc thanks for pointing that out, should I add that to the oauth docs? seems kind of hidden
    (I am finally deploying to staging yeet)
    1 reply
    Frank Cash
    @frankcash
    also, @wslulciuc is there a good healthcheck endpoint for the react app?
    3 replies
    Siri-balupalli
    @Siri-balupalli
    @wslulciuc Have created a db on localhost with the existing schemas/tables. Wanted to add a new table and gave all those details in the inital schema.sql. But the changes are not getting applied to db (i mean new table is not getting added to db) after ./gradlew run. Do i need to add the table/coulmns manually to db ??
    6 replies
    Frank Cash
    @frankcash
    @wslulciuc when you get a chance can you review this MarquezProject/marquez-python#81
    2 replies
    Manel
    @mermi
    Hello, At my company, we are evaluating several ETL tools, as we want to migrate from Matillion, I am looking at Marquez, as an old Airflow user I liked it, when I heard about it in Data Crunch, Our DWH tool so far is Snowflake, I was wondering if there, is possible to have the metadata of Marquez workflow stored in snowflake? and I would love to hear some opinions and thoughts about Marquez, because I cannot find many Articles around the Net. thank you :sunflower:
    Siri-balupalli
    @Siri-balupalli
    @wslulciuc and @everyone ...planning to implement oauth2 on this project. Do need suggestions or input on the same, if anyone has already implemented
    Manel
    @mermi
    Is it possible to have concurrent user login when you use marquez UI ?
    Frank Cash
    @frankcash
    I have an ETL that creates a staging table of a copy of datasets from a distinct source and then upserts it into a final table, how are people modeling this? I feel like skipping the middle layer is the only way to represent it in marquez without making two jobs, but then my colleagues miss that context... but then I have two marquez jobs for 1 airflow DAG
    Manel
    @mermi
    @everyone the link for ADD JOB TO NAMESPACE step in this quickstart is not working https://marquezproject.github.io/marquez/quickstart.html IS there any new version of the job file ?
    Siri-balupalli
    @Siri-balupalli
    @frankcash would like to know if you have implemented oauth2 on this marquez
    1 reply
    sunzhusz
    @sunzhusz
    marquez-web master branch : docker-compose up ,but webui error: Something went wrong while fetching initial data.
    1 reply
    anyone know why?
    thank you
    log show: NmJiBd.png
    Vamsee Lakamsani
    @lakamsani

    Hi, checking out Marquez for the first time. Running it on my MacBook (Catalina/10.15.5) via gradle/main branch (not docker). postgresql 12.3_4 and openjdk version "1.8.0_252".

    Steps 1-3 of quickstart (https://marquezproject.github.io/marquez/quickstart.html) ran fine. However step 4 is returning a 400 like this.

    {"code":400,"message":"Unable to process JSON"}

    curl -X PUT http://localhost:8080/api/v1/namespaces/wedata/jobs/room_bookings_7_days \
      -H 'Content-Type: application/json' \
      -d '{
            "type": "BATCH",
            "inputs": ["wedata.room_bookings"],
            "outputs": [],
            "location": "https://github.com/wework/jobs/commit/124f6089ad4c5fcbb1d7b33cbb5d3a9521c5d32c",
            "description": "Weekly email of room bookings occupancy patterns."
          }'
    {"code":400,"message":"Unable to process JSON"}%

    Guessing it doesn't really check the location?

    1 reply