Skip to content

brimdata/zed-sample-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sample Data

To help you get started quickly with zq, this repository contains small sample sets of Zeek data. There are six different log formats available, all representing events based on the same network traffic:

Directory Format
zeek-default/ Zeek default output format
zeek-ndjson/ Newline-delimited JSON (NDJSON), as output by the Zeek package for JSON Streaming Logs
zng/ Binary ZNG, output with zq's default LZ4-compressed format
zng-uncompressed/ Binary ZNG, output with zq's option -zng.compress=false to disable compression
zson/ ZSON, a Zed text output format that has the look and feel of JSON

This sample data is used frequently for a simple Zed performance test and to check for unexpected changes in the Zed output formats.

Downloading

Because prior changes to the ZNG and ZSON output formats have added some bulk to the revision history, you'll typically want to save time by just downloading the latest revision:

# git clone --depth=1 https://github.com/brimdata/zed-sample-data.git

Origin/License

This sample data set was generated from a subset of the packet capture archives (formerly at https://archive.wrccdc.org/pcaps, though the site has been down of late) that are distributed by the WRCCDC.

This sample data is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, as it is built upon the WRCCDC PCAP data that is distributed under the same license.

Acknowledgement

We would like to express our thanks to the WRCCDC for generously making their packet capture archives available to the public and for commercial use. The terabytes of "real world" data has been invaluable to us in testing the foundations of zq at scale.

Creation

The data set was made from the several PCAP files in the 2018 set. Zeek v6.2.0 was used in its default configuration with the only change being the addition/enabling of the JSON Streaming Logs package. The packet captures were then processed via the command-lines:

# mergecap -w wrccdc.pcap wrccdc.2018-03-24.10*.pcap
# zeek -r wrccdc.pcap local "JSONStreaming::enable_log_rotation=F"

This produced the logs in Zeek default and NDJSON formats. As ZNG and ZSON are not yet output directly by Zeek, these logs were created by sending each Zeek default log through zq, e.g.:

# mkdir -p zng && \
for file in zeek-default/*
do
  zq -f zng "$file" \
      | gzip -n > zng/"$(basename "$file" | sed 's/\.log\.gz//')".zng.gz
done

# mkdir -p zng-uncompressed && \
for file in zeek-default/*
do
  zq -f zng -zng.compress=false "$file" \
      | gzip -n > zng-uncompressed/"$(basename "$file" | sed 's/\.log\.gz//')".zng.gz
done

# mkdir -p zson && \
for file in zeek-default/*
do
  zq -f zson "$file" \
      | gzip -n > zson/"$(basename "$file" | sed 's/\.log\.gz//')".zson.gz
done

Testing

Since the sample ZNG and ZSON logs are generated by zq, regenerating these outputs is a useful zq test. Assuming zq is in your $PATH, a script is provided to regenerate the hash for each ZNG and ZSON log and compare it to a last known "good" hash stored in the md5sums/ directory.

Example output highlighting a format change has been flagged:

# scripts/check_md5sums.sh zng
capture_loss:62949d22a0a557342d28ee5ee4b64d50
...
x509:10333d3d004c718b04cbedb8ee195cca

diff'ing current "zq -f zng" output hashes vs. committed hashes:
7c7
< ftp:c84824c8114df4db745399ff875b0d92
---
> ftp:2d8d90df3c4b84eb9e281a3f10767aa5

  ======> diffs detected! Check for a zq bug or intentional zng format change.
          Current hashes are in /var/folders/yn/jbkxxkpd4vg142pc3_bd_krc0000gn/T/tmp.9X7Gab9I