Home > Docs > Generator API Reference
🔌 Generator API Reference¶
Last Updated: 2026-04-15 | Version: 2.0 Status: ✅ Final | Maintainer: Documentation Team
TL;DR -- This document covers all 19 data generators in the data_generation/generators/ package. Every generator inherits from BaseGenerator, which provides reproducible seeding, output serialization (DataFrame, Parquet, JSON), batch iteration, and PII masking helpers. Generators span four domains: Casino/Gaming (6), Federal Agency (7), Analytics (3), and Streaming (3).
📑 Table of Contents¶
- 🔧 BaseGenerator Interface
- 🎰 Casino Generators
- SlotMachineGenerator
- PlayerGenerator
- FinancialGenerator
- ComplianceGenerator
- SecurityGenerator
- TableGamesGenerator
- 🏛️ Federal Generators
- USDAGenerator
- SBAGenerator
- NOAAGenerator
- EPAGenerator
- DOIGenerator
- TribalHealthcareGenerator
- DOTFAAGenerator
- 📹 Analytics Generators
- VideoAnalyticsGenerator
- PeopleMovementGenerator
- GeolocationGenerator
- ⚡ Streaming Generators
- EventHubProducer
- MultiSourceSimulator
- IoTDeviceSimulator
- 🧩 Extension Guide
- 📋 Return Value Schemas
# BaseGenerator Interface¶
Module: data_generation/generators/base_generator.py Type: Abstract base class (ABC)
All generators inherit from BaseGenerator. It provides seed management, output serialization, batch iteration, and helper utilities.
Constructor¶
BaseGenerator(
seed: int | None = None,
locale: str = "en_US",
start_date: datetime | None = None,
end_date: datetime | None = None,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
seed | int \| None | 42 | Non-negative integer for reproducibility |
locale | str | "en_US" | Faker locale for generated text |
start_date | datetime \| None | now - 30 days | Start of the date range for generated data |
end_date | datetime \| None | now | End of the date range for generated data |
Raises: ValueError if seed is negative or start_date > end_date.
Abstract Method¶
@abstractmethod
def generate_record(self) -> dict[str, Any]:
"""Generate a single record. Must be implemented by subclasses."""
Core Methods¶
generate(num_records, show_progress=True) -> pd.DataFrame¶
Generate multiple records into a DataFrame.
| Parameter | Type | Default | Description |
|---|---|---|---|
num_records | int | -- | Number of records (must be > 0) |
show_progress | bool | True | Show tqdm progress bar |
Returns: pd.DataFrame containing generated records. Raises: ValueError if num_records is not a positive integer.
generate_batches(num_records, batch_size=10000, show_progress=True) -> Iterator[pd.DataFrame]¶
Memory-efficient batch generator.
| Parameter | Type | Default | Description |
|---|---|---|---|
num_records | int | -- | Total record count (must be > 0) |
batch_size | int | 10000 | Records per batch (must be > 0) |
show_progress | bool | True | Print batch progress messages |
Yields: pd.DataFrame batches. Raises: ValueError if either argument is not a positive integer.
to_parquet(df, output_path, partition_cols=None) -> None¶
Save a DataFrame to Parquet format.
| Parameter | Type | Default | Description |
|---|---|---|---|
df | pd.DataFrame | -- | DataFrame to save |
output_path | str \| Path | -- | Output file or directory path |
partition_cols | list[str] \| None | None | Columns to partition by |
Raises: OSError if the output path is not writable.
to_json(df, output_path, orient="records", lines=True) -> None¶
Save a DataFrame to JSON format.
| Parameter | Type | Default | Description |
|---|---|---|---|
df | pd.DataFrame | -- | DataFrame to save |
output_path | str \| Path | -- | Output file path |
orient | str | "records" | JSON orientation |
lines | bool | True | Write as JSON lines |
Raises: OSError if the output path is not writable.
Properties¶
@property
def schema(self) -> dict[str, str]:
"""Return the schema definition for this generator."""
Helper Methods¶
| Method | Signature | Description |
|---|---|---|
random_datetime | (start=None, end=None) -> datetime | Random datetime within configured or specified range |
generate_uuid | () -> str | Generate a UUID v4 string |
hash_value | (value: str, salt: str = "") -> str | SHA-256 hash of a value |
mask_ssn | (ssn: str) -> str | Mask SSN to XXX-XX-1234 format |
mask_card_number | (card_number: str) -> str | Mask card to ****-****-****-1234 |
weighted_choice | (choices, weights) -> Any | Weighted random selection (weights must sum to ~1.0) |
add_metadata_columns | (record: dict) -> dict | Append _ingested_at, _source, _batch_id |
Note: Every
generate_record()call appends three metadata columns viaadd_metadata_columns:_ingested_at(ISO timestamp),_source(class name), and_batch_id(8-char UUID prefix).
# Casino Generators¶
Six generators covering the core casino/gaming domain: slot telemetry, player profiles, cage financials, compliance filings, security events, and table game transactions.
SlotMachineGenerator¶
Module: data_generation/generators/slot_machine_generator.py Purpose: Generates SAS-protocol slot machine telemetry events.
Constructor¶
SlotMachineGenerator(
num_machines: int = 500,
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
num_machines | int | 500 | Number of slot machines to simulate |
Event Types (enum values)¶
GAME_PLAY, JACKPOT, METER_UPDATE, DOOR_OPEN, DOOR_CLOSE, BILL_IN, TICKET_OUT, TILT, POWER_OFF, POWER_ON
Weights: GAME_PLAY dominates at 70%, JACKPOT at 2%, METER_UPDATE at 10%.
Record Schema Fields¶
event_id, machine_id, asset_number, location_id, zone, event_type, event_timestamp, denomination, coin_in, coin_out, jackpot_amount, games_played, theoretical_hold, actual_hold, player_id, session_id, machine_type, manufacturer, game_theme, error_code, error_message
Valid Enums¶
- MACHINE_TYPES:
Video Slots,Mechanical Reels,Video Poker,Progressive - DENOMINATIONS:
0.01through100.00(10 values) - MANUFACTURERS:
IGT,Aristocrat,Konami,Scientific Games,Everi - ZONES:
North,South,East,West,VIP,High Limit,Penny
Domain-Specific Methods¶
get_machines() -> list[dict]-- Returns the pre-generated machine configuration list.
Usage Example¶
from data_generation.generators.slot_machine_generator import SlotMachineGenerator
gen = SlotMachineGenerator(num_machines=200, seed=42)
df = gen.generate(10_000)
gen.to_parquet(df, "output/slot_telemetry.parquet", partition_cols=["zone"])
PlayerGenerator¶
Module: data_generation/generators/player_generator.py Purpose: Generates player profiles with loyalty tiers and PII handling.
Constructor¶
PlayerGenerator(
seed: int | None = None,
include_pii: bool = False,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
include_pii | bool | False | If True, include raw PII; if False, mask/hash |
Record Schema Fields¶
player_id, loyalty_number, first_name, last_name, email, phone, date_of_birth, ssn_hash, ssn_masked, address, city, state, zip_code, country, loyalty_tier, points_balance, lifetime_points, tier_credits, enrollment_date, last_visit_date, total_visits, total_theo, total_actual_win_loss, average_daily_theo, preferred_game, communication_preference, marketing_opt_in, marketing_channel, host_assigned, vip_flag, self_excluded, account_status
Valid Enums¶
- LOYALTY_TIERS:
Bronze(40%),Silver(30%),Gold(18%),Platinum(9%),Diamond(3%) - PREFERRED_GAMES:
Video Slots,Blackjack,Craps,Roulette,Poker,Baccarat,Video Poker - ACCOUNT_STATUS:
Active(85%),Inactive(10%),Suspended(3%),Closed(2%)
Domain-Specific Methods¶
def generate_with_history(
num_players: int,
history_days: int = 30,
) -> tuple[pd.DataFrame, pd.DataFrame]:
"""Generate players with historical visit data.
Returns: (players_df, visits_df)"""
Usage Example¶
gen = PlayerGenerator(seed=42, include_pii=False)
players_df, visits_df = gen.generate_with_history(500, history_days=60)
FinancialGenerator¶
Module: data_generation/generators/financial_generator.py Purpose: Generates cage and financial transaction data with CTR and SAR flagging.
Constructor¶
FinancialGenerator(
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
No additional parameters beyond BaseGenerator.
Transaction Types (14 types, weighted)¶
CASH_IN (20%), CASH_OUT (18%), CHIP_PURCHASE (15%), CHIP_REDEMPTION (12%), WIRE_TRANSFER_IN (3%), WIRE_TRANSFER_OUT (2%), CHECK_CASHING (8%), MARKER_ISSUE (5%), MARKER_PAYMENT (5%), FRONT_MONEY_DEPOSIT (4%), FRONT_MONEY_WITHDRAWAL (3%), JACKPOT_PAYOUT (3%), SAFEKEEPING_DEPOSIT (1%), SAFEKEEPING_WITHDRAWAL (1%)
Record Schema Fields¶
transaction_id, transaction_type, transaction_timestamp, cage_location, cashier_id, supervisor_id, player_id, amount, currency, payment_method, check_number, wire_reference, marker_number, id_type, id_number_hash, ctr_required, ctr_filed, ctr_reference, suspicious_activity_flag, notes, shift, business_date
Important: The
ctr_requiredflag is automatically set toTruefor transactions >= $10,000. SAR structuring detection flags amounts in the \(8,000-\)9,999 range at a 10% rate.
ComplianceGenerator¶
Module: data_generation/generators/compliance_generator.py Purpose: Generates CTR, SAR, and W-2G regulatory filings.
Constructor¶
ComplianceGenerator(
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
Filing Types¶
- CTR (40%) -- Currency Transaction Reports >= $10,000; filed within 15 days
- SAR (15%) -- Suspicious Activity Reports; filed within 30 days
- W2G (45%) -- Gambling Winnings (IRS Form W-2G)
W-2G Thresholds¶
| Game Type | Threshold |
|---|---|
| Slots | $1,200 |
| Video Poker | $1,200 |
| Keno | $1,500 |
| Bingo | $1,200 |
| Poker Tournament | $5,000 |
| Table Games | $600 |
SAR Categories¶
Structuring, Unusual Transaction Pattern, Third Party Activity, Identity Concerns, Employee Involvement, Counterfeit Currency, Wire Transfer Anomaly, Chip Walking, Credit Abuse, Other Suspicious Activity
Domain-Specific Methods¶
def generate_structuring_pattern(
player_id: str,
num_transactions: int = 5,
target_total: float = 25000,
) -> list[dict[str, Any]]:
"""Generate a structuring pattern for SAR detection testing.
Each transaction stays just under the $10K threshold."""
SecurityGenerator¶
Module: data_generation/generators/security_generator.py Purpose: Generates security and surveillance events including access control, camera alerts, and incident reports.
Constructor¶
SecurityGenerator(
num_employees: int = 500,
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
num_employees | int | 500 | Number of employees for badge events |
Event Types (20 types)¶
BADGE_SWIPE (20%), DOOR_ENTRY (15%), ACCESS_GRANTED (10%), ACCESS_DENIED (5%), CAMERA_ALERT (8%), MOTION_DETECTED (6%), CAMERA_OBSTRUCTION (2%), EXCLUSION_CHECK (3%), EXCLUSION_VIOLATION (1%), INCIDENT_REPORT (5%), ALTERCATION (2%), MEDICAL_EMERGENCY (2%), THREAT_DETECTED (1%), WEAPON_DETECTED (1%), TRESPASS (2%), UNAUTHORIZED_ACCESS (3%), SUSPICIOUS_ACTIVITY (4%), PATRON_COMPLAINT (3%), ESCORT_REQUEST (2%), SECURITY_PATROL (5%)
Severity Levels¶
CRITICAL (WEAPON_DETECTED, THREAT_DETECTED, EXCLUSION_VIOLATION), HIGH (ALTERCATION, MEDICAL_EMERGENCY, TRESPASS, UNAUTHORIZED_ACCESS), MEDIUM (INCIDENT_REPORT, SUSPICIOUS_ACTIVITY, ACCESS_DENIED, CAMERA_OBSTRUCTION), LOW (all others)
Zones (14)¶
Main Floor, Cage, Count Room, Vault, Surveillance Room, Server Room, Executive Offices, Employee Entrance, Loading Dock, Parking Garage, Hotel Lobby, Restaurant, Bar, Retail
TableGamesGenerator¶
Module: data_generation/generators/table_games_generator.py Purpose: Generates table game transactions, player ratings, and dealer assignments.
Constructor¶
TableGamesGenerator(
num_tables: int = 100,
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
num_tables | int | 100 | Number of tables to simulate |
Game Types and House Edges¶
| Game Type | Weight | House Edge |
|---|---|---|
| Blackjack | 35% | 0.5% |
| Craps | 20% | 1.4% |
| Roulette | 15% | 5.3% |
| Baccarat | 15% | 1.06% |
| Poker | 10% | 2.5% (rake) |
| Pai Gow | 5% | 2.6% |
Event Types (10 types)¶
BUY_IN (25%), CASH_OUT (20%), MARKER_ISSUED (5%), MARKER_PAID (5%), FILL (8%), CREDIT (7%), RATING_START (12%), RATING_END (12%), SHIFT_START (3%), SHIFT_END (3%)
# Federal Generators¶
Seven generators covering federal agency data: USDA, SBA, NOAA, EPA, DOI, Tribal Healthcare (IHS), and DOT/FAA. Federal generators use a domain parameter on generate_record() to select between sub-domains within the same agency.
USDAGenerator¶
Module: data_generation/generators/federal/usda_generator.py Domains: crop_production (default), food_safety
Constructor¶
USDAGenerator(
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
generate_record(domain="crop_production")¶
Domain crop_production -- NASS QuickStats-style records.
Key fields: record_id, commodity, year, state_fips, state_name, county_fips, county_name, statisticcat_desc, unit_desc, value, cv_percent, source_desc, agg_level_desc, domain_desc, reference_period_desc, load_time
Commodities: CORN (28%), SOYBEANS (24%), WHEAT (16%), COTTON (7%), RICE (5%), BARLEY (5%), OATS (4%), SORGHUM (4%), HAY (4%), POTATOES (3%)
Domain food_safety -- FSIS recall records.
Key fields: recall_id, recall_number, recall_date, product_type, recall_class, reason, risk_level, company_name, establishment_number, city, state, pounds_recalled, distribution, status, load_time
generate_batch(count=1000, domain="crop_production") -> pd.DataFrame¶
SBAGenerator¶
Module: data_generation/generators/federal/sba_generator.py Domains: ppp, 7a, disaster, sbir
Constructor¶
SBAGenerator(
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
generate_record(domain="ppp")¶
Key fields: loan_id, program_type, loan_amount, approval_date, borrower_name, borrower_city, borrower_state, borrower_zip, naics_code, naics_description, jobs_retained, lender_name, sba_office, loan_status, forgiveness_amount, forgiveness_date, term_months, interest_rate, rural_urban, business_type, load_time
| Domain | Program Type | Interest Rate | Term Range | Loan Amount Range |
|---|---|---|---|---|
ppp | PPP | 1.0% fixed | 24 or 60 months | $20K - $10M |
7a | 7A | 5.5% - 8.0% | 60-300 months | $5K - $5M |
disaster | DISASTER | 2.0% - 4.0% | 360 months | $1K - $2M |
sbir | SBIR | 0.0% | 12-36 months | $50K - $1.5M |
Note: PPP domain includes
forgiveness_amountandforgiveness_datefields. Approximately 88% of PPP loans receive full forgiveness.
generate_batch(count=1000, domain="ppp") -> list[dict]¶
NOAAGenerator¶
Module: data_generation/generators/federal/noaa_generator.py Domains: weather (default), storm
Constructor¶
generate_record(domain="weather")¶
Domain weather -- Surface station observations from 18 real US ASOS/AWOS stations.
Key fields: observation_id, station_id, station_name, timestamp, latitude, longitude, elevation_m, parameter, value, unit, quality_flag, data_source, report_type, load_time
Parameters: TEMPERATURE (F), DEWPOINT (F), HUMIDITY (PCT), WIND_SPEED (MPH), WIND_DIRECTION (DEG), PRESSURE (IN_HG), VISIBILITY (MI), PRECIPITATION (IN), CLOUD_COVER (PCT)
Domain storm -- Storm Events Database records.
Key fields: event_id, episode_id, event_type, state, state_fips, county_fips, begin_date, end_date, injuries_direct, injuries_indirect, deaths_direct, deaths_indirect, damage_property, damage_crops, magnitude, magnitude_type, begin_lat, begin_lon, end_lat, end_lon, tor_f_scale, source, flood_cause, load_time
Storm event types: THUNDERSTORM_WIND (30%), HAIL (25%), FLASH_FLOOD (15%), TORNADO (10%), and 10 more.
Overridden generate(num_records, show_progress=True, domain="weather") -> pd.DataFrame¶
The generate() method accepts a domain parameter (unlike base class).
EPAGenerator¶
Module: data_generation/generators/federal/epa_generator.py Domains: air_quality (default), water_quality
Constructor¶
EPAGenerator(
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
domain: str = "air_quality",
)
| Parameter | Type | Default | Description |
|---|---|---|---|
domain | str | "air_quality" | Default domain for generate_record() calls |
generate_record(domain=None)¶
If domain is None, uses the default set at construction.
Domain air_quality -- AQS monitoring site measurements.
Key fields: record_id, site_id, site_name, parameter, parameter_code, date_local, time_local, aqi_value, aqi_category, concentration, units, sample_duration, latitude, longitude, state_code, county_code, state_name, county_name, cbsa_name, method_code, load_time
Air quality parameters: PM2.5, PM10, OZONE, CO, SO2, NO2, LEAD
AQI categories: GOOD (0-50, ~55%), MODERATE (51-100, ~30%), UNHEALTHY_SENSITIVE (101-150, ~10%), UNHEALTHY (151-200, ~4%), VERY_UNHEALTHY/HAZARDOUS (201-500, ~1%)
Domain water_quality -- SDWIS public water system sample results.
Key fields: record_id, system_id, system_name, system_type, sample_date, contaminant, contaminant_code, result_value, unit, mcl, mcl_violation, violation_type, state_code, county_served, population_served, source_type, primacy_agency, load_time
Contaminants: Arsenic, Lead, Nitrate, Fluoride, Copper, Coliform, Turbidity, Chlorine Residual, Trihalomethanes, Radium-226
Important: Approximately 5% of water quality records will have MCL violations (
mcl_violation=True).
DOIGenerator¶
Module: data_generation/generators/federal/doi_generator.py Domains: earthquake (default), land_use
Constructor¶
DOIGenerator(
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
generate_record(domain="earthquake")¶
Domain earthquake -- USGS ComCat-style seismic event records.
Key fields: event_id, usgs_id, time, latitude, longitude, depth_km, magnitude, mag_type, place, event_type, status, tsunami, significance, felt, cdi, mmi, alert, net, nst, gap, rms, url, load_time
Magnitude distribution (Gutenberg-Richter): M1-3 (60%), M3-5 (25%), M5-6 (10%), M6-7 (4%), M7+ (1%)
Seismic zones: US West Coast (28%), Alaska (22%), Japan (12%), Central America (9%), South America (8%), and 5 others.
Domain land_use -- BLM/NPS/FWS/USFS parcel management records.
Key fields: parcel_id, blm_serial_number, managing_agency, state, county, land_type, total_acres, latitude, longitude, designation, designation_date, permit_type, permit_holder, annual_revenue, environmental_assessment, protected_species_present, fire_risk_level, last_inspection_date, load_time
Managing agencies: BLM (35%), USFS (25%), NPS (15%), FWS (10%), BOR (8%), DOD (5%), OTHER (2%)
generate_batch(count=1000, domain="earthquake") -> pd.DataFrame¶
TribalHealthcareGenerator¶
Module: data_generation/generators/federal/tribal_healthcare_generator.py Domains: Single domain (IHS healthcare encounters)
Constructor¶
TribalHealthcareGenerator(
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
generate_record() -> dict[str, Any]¶
No domain parameter. Generates IHS encounter records across 20 facilities, 12 area offices, and 30 tribal affiliations.
Key fields: record_id, patient_id, facility_id, facility_name, encounter_type, encounter_date, icd10_code, icd10_description, cpt_code, cpt_description, provider_id, provider_type, tribal_affiliation, service_unit, area_office, age_group, gender, insurance_type, medication_name, medication_ndc, lab_test_name, lab_result_value, lab_result_unit, lab_abnormal_flag, hipaa_consent, phi_masked, load_time
Encounter Types¶
outpatient (35%), inpatient (8%), emergency (10%), telehealth (7%), dental (12%), behavioral_health (8%), pharmacy (12%), laboratory (8%)
Diagnosis Weighting (ICD-10, reflecting Native American health disparities)¶
| Category | Weight | Example Codes |
|---|---|---|
| Diabetes | 17% | E11.9, E11.65, E11.22 |
| Respiratory | 14% | J06.9, J45.20, J45.40 |
| Cardiovascular | 12% | I10, E78.5, I25.10 |
| Behavioral Health | 12% | F32.1, F10.20, F10.10 |
| Metabolic/Obesity | 10% | E66.01, E66.9 |
| Musculoskeletal | 10% | M54.5, M54.2, M25.50 |
| Gastrointestinal | 8% | K21.0, K58.9 |
| Genitourinary | 7% | N39.0 |
| Dental | 6% | K02.9 |
| Pregnancy | 4% | O24.11 |
Note: All records have
hipaa_consent=Trueandphi_masked=Trueto reflect HIPAA-compliant de-identified data.
generate_batch(count=1000) -> pd.DataFrame¶
DOTFAAGenerator¶
Module: data_generation/generators/federal/dot_faa_generator.py Domains: flight_operations (default), safety_incident, traffic_statistics, infrastructure
Constructor¶
DOTFAAGenerator(
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
generate_record(domain="flight_operations")¶
Domain flight_operations -- BTS On-Time Performance records with 20 carriers and 30 airports.
Key fields: record_id, data_domain, carrier_code, carrier_name, flight_number, origin_airport, destination_airport, departure_date, faa_region, report_year, report_month, scheduled_departure, actual_departure, delay_minutes, delay_cause, cancelled, diverted, aircraft_type, tail_number, passengers, airport_category, runway_id, visibility_miles, wind_speed_knots, load_time
Delay causes: none (65%), carrier (15%), weather (10%), nas (5%), security (3%), late_aircraft (2%). Cancellation rate: 5%. Diversion rate: 1%.
Domain safety_incident -- FAA safety event records.
Additional fields: incident_type, incident_severity
Incident types: bird_strike (30%), turbulence (25%), mechanical (20%), runway_incursion (8%), fuel_issue (6%), medical (5%), security_threat (3%), near_miss (3%)
Domain traffic_statistics -- T-100 aggregate traffic records.
Domain infrastructure -- Airport and runway records.
generate_batch(count=1000, domain="flight_operations") -> list[dict]¶
# Analytics Generators¶
Three generators producing event-level data for video security analytics, people movement tracking, and geolocation services within a casino resort environment.
VideoAnalyticsGenerator¶
Module: data_generation/generators/analytics/video_analytics_generator.py Purpose: Generates video security analytics events from 50 cameras across 14 casino zones.
Constructor¶
VideoAnalyticsGenerator(
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
Event Types¶
| Event Type | Weight | Alert Level |
|---|---|---|
object_detection | 35% | INFO |
zone_crossing | 20% | INFO |
crowd_density | 12% | WARNING |
anomaly | 10% | CRITICAL |
face_match | 8% | WARNING |
loitering | 7% | WARNING |
tailgating | 5% | WARNING |
abandoned_object | 3% | CRITICAL |
Record Schema Fields¶
event_id, camera_id, camera_location, event_type, timestamp, confidence_score, object_class, object_count, bounding_box, track_id, zone_from, zone_to, dwell_time_seconds, anomaly_type, alert_level, frame_number, video_resolution, fps, model_name, model_version, metadata, load_time
Object classes: person (40%), vehicle (10%), bag (12%), chip_tray (8%), cash_bundle (6%), card (8%), phone (7%), weapon (2%), unknown (7%)
Model names: YOLOv8 (35%), DeepSORT (25%), RetinaNet (20%), SSD-MobileNet (10%), FairMOT (10%)
generate_batch(count=1000) -> list[dict]¶
PeopleMovementGenerator¶
Module: data_generation/generators/analytics/people_movement_generator.py Purpose: Generates foot traffic sensor readings from 80 sensors across 29 casino zones on 3 floors.
Constructor¶
PeopleMovementGenerator(
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
Sensor Types¶
wifi_probe (30%), ble_beacon (25%), camera_count (20%), infrared (10%), pressure_mat (10%), lidar (5%)
Directions¶
entering (30%), exiting (25%), stationary (25%), passing_through (20%)
Record Schema Fields¶
event_id, sensor_id, sensor_type, zone_id, zone_name, timestamp, person_count, direction, dwell_time_seconds, velocity_mps, x_coordinate, y_coordinate, floor_level, heat_map_cell, occupancy_percentage, queue_detected, queue_length, queue_wait_minutes, device_mac_hash, signal_strength_dbm, battery_level, calibration_date, load_time
Note: Queue detection activates automatically when
occupancy_percentage > 70%in queue-eligible zones (Cage Windows, Buffet, Steakhouse, Hotel Check-In, Main Bar). Wi-Fi probe sensors include a SHA-256 MAC address hash. BLE beacon sensors include battery level.
generate_batch(count=1000) -> list[dict]¶
GeolocationGenerator¶
Module: data_generation/generators/analytics/geolocation_generator.py Purpose: Generates GPS/indoor positioning events from a 200-device fleet across a Las Vegas casino resort campus.
Constructor¶
GeolocationGenerator(
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
Device Types¶
patron_app (40%), employee_badge (25%), asset_tag (15%), vehicle_gps (8%), shuttle_tracker (7%), valet_tag (5%)
Source Systems and Accuracy¶
| Source System | Weight | Accuracy Range |
|---|---|---|
gps | 35% | 3-15 m |
wifi_triangulation | 25% | 5-30 m |
ble_trilateration | 20% | 1-5 m |
uwb | 10% | 0.1-1.0 m |
hybrid | 10% | 1-10 m |
Record Schema Fields¶
event_id, device_id, device_type, timestamp, latitude, longitude, altitude_meters, accuracy_meters, speed_mps, heading_degrees, h3_index, geofence_id, geofence_name, geofence_event, geofence_dwell_seconds, poi_name, poi_distance_meters, floor_level, indoor_zone, proximity_trigger, source_system, battery_level, load_time
Geofence events: enter (35%), exit (30%), dwell (35%). Geofence interaction rate: ~40% of events.
Proximity triggers (patron_app only, ~15%): marketing_push, loyalty_offer, vip_greeting, safety_alert, staff_dispatch
generate_batch(count=1000) -> list[dict]¶
# Streaming Generators¶
Three generators designed for real-time and near-real-time event streaming scenarios: Event Hub slot telemetry, multi-source CDC events, and IoT device fleet telemetry.
EventHubProducer¶
Module: data_generation/generators/streaming/event_hub_producer.py Purpose: Streams slot machine telemetry events to Azure Event Hub or stdout.
Note: This generator does NOT inherit from
BaseGeneratorin the typical way. It wraps aSlotMachineGeneratorinternally.
Constructor¶
EventHubProducer(
connection_string: str | None = None,
eventhub_name: str | None = None,
events_per_second: float = 10,
seed: int | None = None,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
connection_string | str \| None | None | Azure Event Hub connection string |
eventhub_name | str \| None | None | Event Hub name |
events_per_second | float | 10 | Target event rate |
seed | int \| None | None | Random seed passed to inner generator |
Modes: If connection_string is provided and azure-eventhub is installed, events are sent to Event Hub. Otherwise, events are printed to stdout as JSON lines.
Public Methods¶
| Method | Signature | Description |
|---|---|---|
generate_event | () -> dict | Generate a single slot machine event |
run_sync | (duration_seconds=None, max_events=None, callback=None) | Run synchronously with rate limiting |
run_async | (duration_seconds=None, max_events=None, batch_size=100) | Run asynchronously with batched Event Hub sends |
stop | () | Stop a running producer |
Properties¶
| Property | Type | Description |
|---|---|---|
event_count | int | Number of events generated so far |
CLI Usage¶
python -m data_generation.generators.streaming.event_hub_producer \
--rate 50 --duration 60 --seed 42
MultiSourceSimulator¶
Module: data_generation/generators/streaming/multi_source_simulator.py Purpose: Generates CDC (Change Data Capture) events from 5 database source types for Eventstreams demos.
Constructor¶
MultiSourceSimulator(
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
generate_record(source_type="sql_server")¶
| Source Type | Enum Value | Connectors | Latency (ms) | Has LSN | Has SCN | Has Partition Key |
|---|---|---|---|---|---|---|
sql_server | SQL_SERVER | DEBEZIUM, FABRIC_MIRRORING | 50-2000 | Yes | No | No |
azure_sql | AZURE_SQL | FABRIC_MIRRORING, CHANGE_FEED | 10-500 | No | No | No |
cosmos_db | COSMOS_DB | CHANGE_FEED, FABRIC_MIRRORING | 10-200 | No | No | Yes |
ibm_db2 | IBM_DB2 | INFOSPHERE_CDC | 500-5000 | No | No | No |
oracle | ORACLE | GOLDEN_GATE, LOGMINER | 100-3000 | No | Yes | No |
Record Schema Fields¶
event_id, source_type, operation, timestamp, server_name, database_name, schema_name, table_name, primary_key, before_image, after_image, transaction_id, lsn, scn, partition_key, sequence_number, connector_name, latency_ms, schema_version, load_time
Operations: INSERT (40%), UPDATE (35%), DELETE (15%), READ (10%)
Note:
before_imageandafter_imagefollow CDC conventions: INSERT has onlyafter_image, DELETE has onlybefore_image, UPDATE has both, READ has onlyafter_image.
Additional Methods¶
def generate_batch(count=1000, source_type="sql_server") -> list[dict]:
"""Generate events from a single source type."""
def generate_mixed_batch(count=1000) -> list[dict]:
"""Generate events drawn randomly from all source types."""
IoTDeviceSimulator¶
Module: data_generation/generators/streaming/iot_device_simulator.py Purpose: Simulates telemetry from a heterogeneous fleet of casino IoT devices via Azure IoT Hub conventions.
Constructor¶
IoTDeviceSimulator(
num_devices: int = 100,
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
num_devices | int | 100 | Approximate fleet size (proportionally distributed) |
Device Types (7)¶
| Device Type | Default Count | Protocol | Telemetry Interval |
|---|---|---|---|
slot_machine | 500 | MQTT | 5 sec |
table_sensor | 80 | MQTT | 10 sec |
hvac_sensor | 40 | AMQP | 60 sec |
door_sensor | 60 | MQTT | 1 sec |
camera | 120 | HTTPS | 30 sec |
beacon | 200 | MQTT | 3 sec |
environmental | 50 | MQTT | 15 sec |
generate_record(device_type="slot_machine")¶
Record schema fields: message_id, device_id, device_type, timestamp, protocol, hub_name, enqueued_time, correlation_id, content_type, properties, telemetry, location_zone, firmware_version, signal_strength_dbm
The telemetry field is a nested object whose structure varies by device type (e.g., slot machines include coin_in_meter, status, error_code; HVAC sensors include temperature, humidity, co2_level).
Additional Methods¶
| Method | Signature | Description |
|---|---|---|
generate_batch | (count=1000, device_type="slot_machine") -> list[dict] | Batch of messages for one device type |
generate_fleet_snapshot | () -> list[dict] | One message per device in the fleet |
get_fleet | () -> dict[str, list[dict]] | Pre-generated device registry |
fleet_summary | () -> dict[str, int] | Device count per type |
# Extension Guide¶
To create a custom generator, inherit from BaseGenerator and implement generate_record().
Minimal Template¶
"""
Custom Generator
================
Generates synthetic data for [your domain].
"""
from datetime import datetime
from typing import Any
import numpy as np
from data_generation.generators.base_generator import BaseGenerator
class CustomGenerator(BaseGenerator):
"""Generate synthetic [domain] data."""
def __init__(
self,
seed: int | None = None,
start_date: datetime | None = None,
end_date: datetime | None = None,
):
super().__init__(seed=seed, start_date=start_date, end_date=end_date)
self._schema = {
"record_id": "string",
"timestamp": "datetime",
"value": "float",
"category": "string",
}
def generate_record(self) -> dict[str, Any]:
"""Generate a single record."""
record = {
"record_id": self.generate_uuid(),
"timestamp": self.random_datetime().isoformat(),
"value": round(float(np.random.uniform(0, 100)), 2),
"category": self.weighted_choice(
["A", "B", "C"],
[0.5, 0.3, 0.2],
),
}
return self.add_metadata_columns(record)
Implementation Checklist¶
- Inherit from
BaseGenerator - Call
super().__init__()in your constructor - Define
self._schemawith field names and types - Implement
generate_record()returning adict[str, Any] - Call
self.add_metadata_columns(record)before returning - Use
self.weighted_choice()for realistic distributions - Use
self.random_datetime()for timestamp generation - Use
self.generate_uuid()for unique identifiers - Use
self.hash_value()/self.mask_ssn()for PII protection - (Optional) Add a
generate_batch()method if you need domain-specific batch logic
Adding Domain Support¶
For multi-domain generators (like the federal generators), override generate_record() with a domain parameter:
def generate_record(self, domain: str = "default") -> dict[str, Any]:
if domain == "domain_a":
return self._generate_domain_a_record()
elif domain == "domain_b":
return self._generate_domain_b_record()
raise ValueError(f"Unknown domain '{domain}'")
# Return Value Schemas¶
Each casino generator has a corresponding JSON schema file in data_generation/schemas/. Federal, analytics, and streaming generators define their schemas inline via the _schema property.
Casino Generator Schema Mapping¶
| Generator | Schema File |
|---|---|
SlotMachineGenerator | data_generation/schemas/slot_telemetry_schema.json |
PlayerGenerator | data_generation/schemas/player_profile_schema.json |
FinancialGenerator | data_generation/schemas/financial_transaction_schema.json |
ComplianceGenerator | data_generation/schemas/compliance_filing_schema.json |
SecurityGenerator | data_generation/schemas/security_events_schema.json |
TableGamesGenerator | data_generation/schemas/table_games_schema.json |
Federal / Analytics / Streaming Schema Access¶
These generators define schemas programmatically. Access via the schema property:
Common Metadata Columns (all generators)¶
Every record returned by any generator includes these three metadata fields appended by add_metadata_columns():
| Field | Type | Description |
|---|---|---|
_ingested_at | string | ISO-8601 timestamp of generation |
_source | string | Generator class name |
_batch_id | string | 8-character UUID prefix for batching |
Last updated: 2026-03-11 | Generator API Reference v1.0.0